Affiliation Department |
Department of Computer Science
|
Title |
Associate Professor |
Other name(s) |
Sako Shinji |
Contact information |
|
Homepage |
|
External Link |
SAKO Shinji
|
|
Research Interests
-
Music Signal Processing
-
Music Information Processing
-
Sign Language Recognition
-
Singing Voice Synthesis
-
Speech Synthesis
Research Areas
-
Life Science / Rehabilitation science
-
Informatics / Kansei informatics
-
Informatics / Perceptual information processing
From School
-
Nagoya Institute of Technology Faculty of Engineering Department of Intelligent Information Systems Graduated
1995.04 - 1999.03
Country:Japan
From Graduate School
-
Nagoya Institute of Technology Graduate School, Division of Engineering Department of Electrical & Computer Engineering Doctor's Course Completed
2001.04 - 2004.03
Country:Japan
External Career
-
Advanced Telecommunications Research Institute International
2003.04 - 2003.06
Country:Japan
-
The University of Tokyo Graduate School of Information Science and Technology Research Assistant
2004.04 - 2007.03
Country:Japan
-
AGH University of Science and Technology Faculty of Computer Science, Electronics and Telecommunications Guest Scientists
2014.07 - 2014.08
Country:Poland
-
Technical University Munich Institute for Human-Machine Communication Guest Scientists
2012.06 - 2012.12
Country:Germany
-
Technical University of Munich Institute for Human-Machine Communication JSPS Scientist for Joint lntemational Research
2016.07 - 2017.03
Country:Japan
Professional Memberships
-
Japanese Association of Sign Linguistics
2010.06
-
Human Interface Society
2010.06
-
電気関係学会東海支部連合大会実行委員会
2009.04 - 2009.12
-
高度言語情報融合フォーラム
2008.07
-
The Institute of Image Information and Television Enginerrs
2007.10
Qualification Acquired
-
Software Design & Development Engineer/Information Processing Engineer, Class 1
Papers
-
3D Ego-Pose Lift-Up Robustness Study for Fisheye Camera Perturbations Reviewed International journal
Teppei Miura, Shinji Sako, Tsutomu Kimura
Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications 4 600 - 606 2023.02
Language:English Publishing type:Research paper (international conference proceedings)
3D egocentric human pose estimations from a mounted fisheye camera have been developed following the advances in convolutional neural networks and synthetic data generations. The camera captures different images that are affected by the optical properties, the mounted position, and the camera perturbations caused by body motion. Therefore, data collecting and model training are main challenges to estimate 3D ego-pose from a mounted fisheye camera. Past works proposed synthetic data generations and two-step estimation model that consisted of 2D human pose estimation and subsequent 3D lift-up to overcome the tasks. However, the works insufficiently verify robustness for the camera perturbations. In this paper, we evaluate existing models for robustness using a synthetic dataset with the camera perturbations that increases in several steps. Our study provides useful knowledges to introduce 3D ego-pose estimation for a mounted fisheye camera in practical.
-
Visualization of Affective Information in Music Using Chironomie Reviewed International journal
2022.09
Authorship:Last author Language:English Publishing type:Research paper (international conference proceedings)
-
Simple yet effective 3D ego-pose lift-up based on vector and distance for a mounted omnidirectional camera Reviewed International journal
Teppei Miura, Shinji Sako
Applied Intelligence 2022.05
Language:English Publishing type:Research paper (scientific journal) Publisher:Springer
Following the advances in convolutional neural networks and synthetic data generation, 3D egocentric body pose estimations from a mounted fisheye camera have been developed. Previous works estimated 3D joint positions from raw image pixels and intermediate supervision during the process. The mounted fisheye camera captures notably different images that are affected by the optical properties of the lens, angle of views, and setup positions. Therefore, 3D ego-pose estimation from a mounted fisheye camera must be trained for each set of camera optics and setup. We propose a 3D ego-pose estimation from a single mounted omnidirectional camera that captures the entire circumference by back-to-back dual fisheye cameras. The omnidirectional camera can capture the user’s body in the 360∘ field of view under a wide variety of motions. We also propose a simple feed-forward network model to estimate 3D joint positions from 2D joint locations. The lift-up model can be used in real time yet obtains accuracy comparable to those of previous works on our new dataset. Moreover, our model is trainable with the ground truth 3D joint positions and the unit vectors toward the 3D joint positions, which are easily generated from existing publicly available 3D mocap datasets. This advantage alleviates the data collection and training burden due to changes in the camera optics and setups, although it is limited to the effect after the 2D joint location estimation.
-
3D skeleton motion generation of double bass from musical score Reviewed International journal
Takeru Shirai, Shinji Sako
15th International Symposium on Computer Music Multidisciplinary Research (CMMR) 41 - 46 2021.11
Language:English Publishing type:Research paper (international conference proceedings)
In this study, we propose a method for generating 3D skeleton motions of a double bass player from musical score information using a 2-layer LSTM network. Since there is no suitable dataset for this study, we have created a new motion dataset with actual double bass performance. The contribution of this paper is to show the effect of combining bowing and fingering information in the generation of performance motion, and to examine the effective model structure in performance generation. Both objective and subjective evaluations showed that the accuracy of generating performance motion for double bass can be improved using two types of additional information (bowing, fingering information) and improved by constructing a model that takes into account bowing and fingering.
-
SynSLaG: Synthetic Sign Language Generator Reviewed International journal
Teppei Miura, Shinji Sako
ASSETS '21: The 23rd International ACM SIGACCESS Conference on Computers and Accessibility ( 90 ) 1 - 4 2021.10
Language:English Publishing type:Research paper (international conference proceedings) Publisher:Association for Computing Machinery
Machine learning techniques have the potential to play an important role in sign language recognition. However, sign language datasets lack the volume and variety necessary to work well. To enlarge these datasets, we introduce SynSLaG, a tool that synthetically generates sign language datasets from 3D motion capture data. SynSLaG generates realistic images of various body shapes with ground truth 2D/3D poses, depth maps, body-part segmentations, optical flows, and surface normals. The large synthetic datasets provide possibilities for advancing sign language recognition and analysis.
-
Recognition of JSL fingerspelling using Deep Convolutional Neural Networks Reviewed International journal
Bogdan Kwolek, Wojciech Baczynski, Shinji Sako
Neurocomputing 2021.06
Language:English Publishing type:Research paper (scientific journal)
In this paper, we present approach for recognition of static fingerspelling in Japanese Sign Language on RGB images. Two 3D articulated hand models have been developed to generate synthetic fingerspellings and to extend a dataset consisting of real hand gestures.In the first approach, advanced graphics techniques were employed to rasterize photorealistic gestures using a skinned hand model. In the second approach, gestures rendered using simpler lighting techniques were post-processed by a modified Generative Adversarial Network. In order to avoid generation of unrealistic fingerspellings a hand segmentation term has been added to the loss function of the GAN. The segmentation of the hand in images with complex background was done by proposed ResNet34-based segmentation network. The finger-spelled signs were recognized by an ensemble with both fine-tuned and trained from scratch neural networks. Experimental results demonstrate that owing to sufficient amount of training data a high recognition rate can be attained on RGB images. The JSL dataset with pixel-level hand segmentations is available for download.
-
Fingerspelling recognition using synthetic images and deep transfer learning Reviewed
Nguyen Tu Nam, Shinji Sako, Bogdan Kwolek
2020 The 13th International Conference on Machine Vision (ICMV 2020) 2020.11
Language:English Publishing type:Research paper (international conference proceedings)
Although gesture recognition has been intensely studied for decades, it is still a challenging research topic due to difficulties posed by background complexity, occlusion, viewpoint, lighting changes, the deformable and articulated nature of hands, etc. Numerous studies have shown that extending the training dataset with real images about synthetic images improves the recognition accuracy. However, little work is devoted to demonstrate what improvements in recognition can be achieved thanks to transferring the style onto synthetically generated images from the real gestures. In this paper, we propose a novel method for Japanese fingerspelling recognition using both real and synthetic images generated on the basis of a 3D hand model. We propose to employ a neural style transfer to include information from real images onto synthetically generated dataset. We demonstrate experimentally that neural style transfer and discriminative layer training applied to training deep neural models allow obtaining considerable gains in the recognition accuracy.
-
Study on Effective Combination of Features for Non-word Speech Recognition of Phonological Examination Reviewed
Toshiharu Tadano,Masahiko Nawate,Fumihito Ito,Shinji Sako
IPSJ Journal 61 ( 10 ) 1647 - 1657 2020.10
Language:Japanese Publishing type:Research paper (scientific journal) Publisher:Information Processing Society of Japan
Developmental dyslexia is a main element of learning disability and its early detection is very important for intervention and reading treatment. A convenient screening test using PC has been published and the answer times in text reading, reversed reading of word and mora skip of word are automatically recorded in the test. However, the correctness determination must be done by tester. In order to automate those test, a speech recognition technology corresponding to a non-word that are non-meaningful words used in an examination is necessary, but in conventional speech recognition, recognition precision for non-words is low. Therefore, while reinforcing the function of conventional speech recognition, the accuracy for non-words to be improved to a level that can be practically used for phoneme examination. In this study, we have tried to improve the accuracy for non-words by incorporating a mechanism to determine non-word correctness into Julius, which is in the public domain and can be modified freely. In addition, six candidates are given as feature quantities of speech, and the trend of the accuracy by the combination is examined. As a result, depending on the target non-word, the accuracy was 75.0% to 95.0%, and the overall average value was 87.5%.
-
3D human pose estimation model using location-maps for distorted and disconnected images by a wearable omnidirectional camera Reviewed International journal
Teppei Miura, Shinji Sako
IPSJ Transactions on Computer Vision and Applications 12 ( 4 ) 1 - 17 2020.08
Language:English Publishing type:Research paper (scientific journal) Publisher:Information Processing Society of Japan
We address a 3D human pose estimation for equirectangular images taken by a wearable omnidirectional camera. The equirectangular image is distorted because the omnidirectional camera is attached closely in front of a person’s neck. Furthermore, some parts of the body are disconnected on the image; for instance, when a hand goes out to an edge of the image, the hand comes in from another edge. The distortion and disconnection of images make 3D pose estimation challenging. To overcome this difficulty, we introduce the location-maps method proposed by Mehta et al.; however, the method was used to estimate 3D human poses only for regular images without distortion and disconnection. We focus on a characteristic of the location-maps that can extend 2D joint locations to 3D positions with respect to 2D-3D consistency without considering kinematic model restrictions and optical properties. In addition, we collect a new dataset that is composed of equirectangular images and synchronized 3D joint positions for training and evaluation. We validate the location-maps’ capability to estimate 3D human poses for distorted and disconnected images. We propose a new location-maps-based model by replacing the backbone network with a state-of-the-art 2D human pose estimation model (HRNet). Our model is a simpler architecture than the reference model proposed by Mehta et al. Nevertheless, our model indicates better performance with respect to accuracy and computation complexity. Finally, we analyze the location-maps method from two perspectives: the map variance and the map scale. Therefore, some location-maps characteristics are revealed that (1) the map variance affects robustness to extend 2D joint locations to 3D positions for the 2D estimation error, and (2) the 3D position accuracy is related to the 2D locations relative accuracy to the map scale.
-
Constructing a Highly Accurate Japanese Sign Language Motion Database Including Dialogue Reviewed International journal
Yuji Nagashima, Keiko Watanabe, Daisuke Hara, Yasuo Horiuchi, Shinji Sako, Akira Ichikawa
International Conference on Human-Computer Interaction 2020 76 - 81 2020.07
Language:English Publishing type:Research paper (international conference proceedings) Publisher:Springer
Books and Other Publications
-
菊澤律子・吉岡乾 編著 ほか( Role: Contributor , 言語認識装置の進化)
図書出版 文理閣 2023.04 ( ISBN:9784892599248 )
-
Speech Communication and People with Disabilities
Akira Ichikawa, Yuji Nagashima, Akira Okamoto, Naoto Kato, Shinji Sako, Tetsuya Takiguchi, Daisuke Hara, Michiru Makuuchi( Role: Joint author , Chapter 2 Speech and Communication Disorders)
Corona Publishing 2021.07 ( ISBN:9784339013429 )
Total pages:242 Language:jpn Book type:Scholarly book
Misc
-
HMM-based Automatic Sign Language Recognition using Phonemic Structure of Japanese Sign Language
Shinji Sako, Tadashi Kitamura
Journa of The Japan Society for Welfare Engineering 17 ( 2 ) 2 - 7 2015.11
Authorship:Lead author Language:Japanese Publishing type:Article, review, commentary, editorial, etc. (international conference proceedings) Publisher:Japan Society for Welfare Engineering
-
Speech/Sound based Human Interfaces (1) Construction of Speech Synthesis Systems using HTS Reviewed
Keiichiro Oura, Heiga Zen, Shinji Sako, Keiichi Tokuda
Human interface 12 ( 1 ) 35 - 40 2010.02
Language:Japanese Publishing type:Article, review, commentary, editorial, etc. (international conference proceedings) Publisher:Human interface Society
-
特集 音楽とOR―日本語歌詞からの自動作曲 Invited
嵯峨山 茂樹,中妻 啓,深山 覚,酒向 慎司,西本 卓也
Operations research as a management science 54 ( 9 ) 546 - 553 2009.10
Language:Japanese Publishing type:Article, review, commentary, editorial, etc. (scientific journal) Publisher:日本オペレーションズ・リサーチ学会
本稿では,任意の日本語テキストの持つ韻律に基づき,歌唱曲を自動作曲する手法について解説する.文学作品や自作の詩,ニュースやメールなど,あらゆる日本語テキストをそのまま歌詞として旋律を生成し,歌唱曲として出力する自動作曲システムは,手軽な作曲のツール,音楽の専門知識を持たない人のための作曲補助ツールとして有用であろう.さらに著作権問題の回避としても用途があろう.歌唱曲は歌詞との関連性が求められる.特に高低アクセントを持つ日本語では,発話音声にピッチの高低が付くため,歌詞を朗読する際の韻律と旋律が一致することが重要とされる.筆者らはこの点に着目し,ユーザが選択した和声,リズム,伴奏音形を拘束条件として,旋律を音高間を遷移する経路とし,韻律の上下動の制限の下で最適経路となる旋律を動的計画法により探索する問題として旋律設計を捉えた.このモデルに基づき,任意の日本語歌詞に,その韻律に一致した旋律を付ける自動作曲手法により自動作曲システムOrpheusを作成したので紹介する.
Presentations
-
Current progress on automatic sign language recognition and translation, and its future prospects Invited
Shinji Sako
IEICE Technical Committee on Communication Systems (CS) 2022.11 Institute of Electronics, Information and Communication Engineers
Event date: 2022.11
Language:Japanese Presentation type:Oral presentation (invited, special)
Venue:Nagoya Institute of Technology Country:Japan
In Japan today, it is said that approximately 340,000 people with hearing or speech disabilities are among the holders of physical disability certificates. There are various forms of communication ways for the hearing impaired, depending on differences in hearing ability, congenital or partial hearing loss, etc. One of these ways is sign language. Sign language is a visual language, a natural language with its own grammar system. In Japan, a unique sign language called Japanese sign language is used, which has its own grammatical system different from that of Japanese as a spoken language. On the other hand, due to the limited number of people with normal hearing who have learned sign language, communication between people with hearing impairment and people with normal hearing is conducted through written or oral communication. Both of these communication methods cause stress for both the hearing-impaired and the hearing-impaired, or one of them. There are limited situations in which a sign language interpreter can intervene, and there are also situations in which it is difficult for the sign interpreter to intervene due to privacy issues. For these backgrounds, various research has been conducted for many years on the reading of signs by machines (sign language recognition and translation) and the generation of signs (sign language synthesis). In this talk, the author will discuss the fundamental characteristics of sign languages, as well as the trends and future prospects of research on sign language recognition and translation technologies.
-
Wearable MoCap System for User's Lifelog with the Surrounding Image
Teppei Miura, Shinji Sako
IEICE HCG Symposium 2021 Institute of Electronics, Information and Communication Engineers
Event date: 2021.12
Language:Japanese Presentation type:Oral presentation (general)
Venue:Online meeting (Zoom)
We propose a wearable motion capture system to collect user's lifelog with the surrounding image. We develop the prototype system composed of a wearable omnidirectional camera, a GPU single board computer, and a 3D pose estimation model. We train and validate the prototype system by synthetic datasets. Additionally, we qualitatively evaluate the 3D pose estimation model by real datasets under natural environment.
-
3-D motion generation for double bass performance from musical score International conference
Shinji Sako, Takeru Shirai
14th International Workshop on Machine Learning and Music
Event date: 2021.12
Language:English Presentation type:Oral presentation (general)
Venue:Online (Zoom)
We propose a method for generating 3-D motions of the double bass player from musical score. Generating 3-D motions of performance would be promising to realize performances by virtual player (avatar) or robots, and can also be useful for performance training for
beginners. There have been many studies to generate musical performances, but not many of them generate the human motion of the performance. There are a few previous studies on generating performance actions for piano and violin. In addition, large dataset which contains 3-D movements of performances are not available.
In this study, we developed a small 3-D motion dataset for actual double bass performance. PERCEPTION NEURON, inertial motion capture device is used to capture a performance movement. 3-D coordinates of 15 points of the body are recorded at 30 fps for 13 pieces of "Franz Simandl / 30 Etudes for the double bass". Since this is an elementary study, the data set is relatively small, with one male performer and about 30 minutes. We
utilize a 2-layer LSTM (Long Short Term Memory) network to convert from musical score to 3-D motion. The contribution of this work is effect of combining bowing and fingering information with musical score in the generation of performance motion, and to examine the effectiveness of the model structure in performance generation.
We conducted the evaluation experiment from two perspectives. The first is to evaluate the geometric accuracy of the generated 3-D trajectory, and the second is to evaluate the naturalness of the generated 3-D motion as a performance. The results showed that the accuracy of generated motion for double bass can be improved using two types of additional information (bowing, fingering) in addition to musical score information. -
Significance of the publication "Speech communication and people with disabilities"
Akira Ichikawa, Yuji Nagashima, Akira Okamoto, Naoto Kato, Shinji Sako, Akira Ichikawa, Yuji Nagashima, Daisuke Hara, Michiru Makuuchi
IEICE 115th Technical Committee on Well-being Information Technology (WIT) Institute of Electronics, Information and Communication Engineers
Event date: 2021.12
Language:Japanese Presentation type:Oral presentation (general)
Venue:Online meeting (Zoom)
The book we authored, "Speech Communication and People with Disabilities," (edited by Acoustical Society of Japan, Acoustic Science Series 22, Corona Publishing) aims to clarify the function of communication by analyzing sign language and finger Braille across in a cross-sectional manner from the knowledge of speech. One of the features of this book is that it shows the characteristics that form the common basis for the origin of language, the language of dialogue. For this purpose, the book deals with auditory language (speech), visual language (sign language), and tactile language (finger Braille and tactile sign language) in a cross-sectional manner. We will introduce the outline and significance of the book for researchers in the field of well-being information technology.
-
Music Mood Recognition Based on Synchronized Audio and Lyrics International conference
Sho Ikeda, Shinji Sako
22nd International Society for Music Information Retrieval Conference International Society for Music Information Retrieval
Event date: 2021.11
Language:English Presentation type:Poster presentation
Venue:Online
The aim of our study is to improve the accuracy of music mood recognition using audio and lyrics. As a method, we make a dataset in which audio and lyrics are synchronized, and utilize both lyrics and audio modality for mood recognition. There are few research that deal with the synchronization of audio and lyrics in music mood recognition. Therefore, we make a dataset by extracting the part of lyrics sung in audio. Using the dataset, We investigate the impact of lyric and audio synchronization on music mood recognition tasks. In our experiments, we extract the word embedding representation from lyrics as a feature, and perform music mood recognition using a deep neural network. To verify the effectiveness of synchronizing audio and lyrics, we conduct the experiment in terms of the number of words in the lyrics and the number of music clips.
-
Attribute-Aware Deep Music Transformation For Polyphonic Music International conference
Yuta Matsuoka, Shinji Sako
22nd International Society for Music Information Retrieval Conference International Society for Music Information Retrieval
Event date: 2021.11
Language:English Presentation type:Poster presentation
Venue:Online
ecent machine learning technology have made it possible to automatically create a variety of new music. And many approaches have been proposed to control musical attributes such as pitch and rhythm of the generated music. However, most of them focus only on monophonic music. In this study, we apply the deep music transformation model, which can control the musical attributes of monophonic music, to polyphonic music. We employ Performance Encoding, which can efficiently describe polyphonic music, as the input to the model. To evaluate the proposed method, we performed music transformation using a polyphonic music dataset.
-
三浦 哲平, 酒向 慎司
電子情報通信学会 第114回福祉情報科学研究会 電子情報通信学会
Event date: 2021.10
Language:Japanese Presentation type:Oral presentation (general)
Venue:オンラインミーティング(Zoom)
-
Dynamics Restoration for "Loud" Popular Music
Hyuga Ozeki, Shinji Sako
IPSJ 132th Special Interest Group on MUSic and computer (SIGMUS) Information Processing Society of Japan
Event date: 2021.09
Language:Japanese Presentation type:Oral presentation (general)
Venue:Online meeting
In the production of popular music, mastering engineers tend to excessively increase the volume level of songs. However, those loud songs with low dynamics are often unsuitable for recent listening styles. Therefore, in this study we attempt to restore the dynamics of them, by estimating the short-term loudness
before mastering from the spectrogram after mastering. -
A study on multi-part beat tracking for mixed music signal with timing discrepancy
Kazuki Fukutani, Shinji Sako
IPSJ 131th Special Interest Group on MUSic and computer (SIGMUS) Information Processing Society of Japan
Event date: 2021.06
Language:Japanese Presentation type:Poster presentation
Venue:Online meeting
In this study, we attempted to track the beat positions of multiple parts simultaneously for a mixture of musical performances with one beat label sequence for each instrument and multiple beat label sequences, and proposed a new method for such multi-part beat tracking. The effectiveness of the proposed method was confirmed by comparing it with a combined method of beat tracking for single sounds separated by a source separation method.
-
Teppei Miura, Shinji Sako
IEICE 112th Technical Committee on Well-being Information Technology (WIT) Institute of Electronics, Information and Communication Engineers
Event date: 2021.06
Language:Japanese Presentation type:Oral presentation (general)
Venue:Online meeting (Zoom)
Sign language is a interactive visual language used by deaf people. However, most hearing people do not learn the sign language. They usually communicate through writing or interpreters. The portable sign language recognition and translation system is necessary for interactive and direct communication in daily use. We have been developing a mobile motion capture system that acquire signer's body motion toward applying to sign language recognition and translation. The system deteriorates the accuracy of 2D / 3D pose estimation in real environment because of lacking the training dataset. To estimate the pose in higher accuracy, we propose a technique using OpenPose that is high-quality 2D pose estimation tool.
Industrial Property Rights
-
単語決定システム
青井基行,赤津 舞子,三浦 七瀬,酒向 慎司
Applicant:株式会社ユニオンソフトウェアマネイジメント,国立大学法人 名古屋工業大学
Application no:特願2018-048022 Date applied:2018.03
Country of applicant:Domestic Country of acquisition:Domestic
-
飲酒状態判定装置及び飲酒状態判定方法
岩田 英三郎, 酒向 慎司
Application no:PCT/JP2010/062776 Date applied:2010.07
Announcement no:特開2011-553634 Date announced:2012.06
Country of applicant:Domestic Country of acquisition:Domestic
本発明は、キーワードのような特定の言葉の利用を前提としない飲酒判定を可能とするものである。飲酒モデルは、飲酒者の音声の音響特徴による分類基準を用いた木構造を有する。この木構造におけるノードは、飲酒者の音素における音響特徴を示す。非飲酒モデルは、非飲酒者の音声の音響特徴による分類基準を用いた木構造を有する。この木構造におけるノードは、非飲酒者の音素における音響特徴を示す。まず、対象者の音声データを、飲酒モデルと非飲酒モデルのそれぞれの木構造に適用して、音素の音響特徴をノードに振り分ける。つぎに、対象者の音素の音響特徴と、各モデルにおける各ノードで特定された音響特徴との尤度を計算する。つぎに、算出された尤度の値を用いて、当該音声の音響特徴が、飲酒モデル及び非飲酒モデルのうちのどちらに近いかを判別する。
-
音声合成方法及び装置
嵯峨山 茂樹, 槐 武也, 酒向 慎司, 松本 恭輔, 西本 卓也
Application no:特願2005-304082 Date applied:2005.10
Announcement no:特開2007-114355 Date announced:2007.05
Country of applicant:Domestic Country of acquisition:Domestic
【課題】高品質の合成音声を提供すると共に、加工性に優れた音声合成手法を提供する。【解決手段】音声のスペクトル包絡を混合ガウス分布関数で近似することで少数のパラメータによって音声スペクトルを表現して分析パラメータを得る。そして、この混合ガウス分布関数の逆フーリエ変換であるGabor関数の重ね合わせを基本波形とし、それをピッチ周期ごとに配置して有声音を合成する。ピッチ周期をランダムにすれば無声音も合成できる。
-
音声認識装置及びコンピュータプログラム
山口 辰彦, 酒向 慎司, 山本 博史, 菊井 玄一郎
Applicant:株式会社国際電気通信基礎技術研究所
Application no:特願2003-317559 Date applied:2003.09
Announcement no:特開2005-84436 Date announced:2005.03
Country of applicant:Domestic Country of acquisition:Domestic
課題】あるモデルによる音声認識の誤りを、他のモデルによる音声認識結果で置換する際に、最終的な音声認識の精度を高める。【解決手段】音声認識装置は、N−グラムモデルを用いて音声認識を行ない、N−グラム候補44及び信頼度尺度を出力する音声認識部40、音声認識部40からのN−グラム候補44に対し、正誤を判別するように最適化された予備判別部46、予備判別部46が誤りと判定した箇所について、用例文モデルを用いて音声認識を行ない、用例文候補52と信頼度を算出する用例候補選択部50、N−グラム候補44を用例文候補52で置換するか否かを判別し最終の音声認識結果28を出力する最終判別部54とを含み、予備判別部46は、学習により得られた判別基準より多くの誤りを検出するようにバイアスした判別基準を用いて判別する。
Works
-
Kogakuin University Japanese Sign Language Multi-Dimensional Database (2nd Term)
Yuji Nagashima, Daisuke Hara, Yasuo Horiuchi, Shinji Sako
2022.10
Work type:Database science
The purpose of this dataset is to create a general-purpose sign language database that can be used in a variety of research fields. The data set contains as much high-definition and high-accuracy data as possible for more than 6,000 sign words and a few dialogues selected by the project. The subjects were two native Japanese signers (one male and one female) from native sign language families, and the filming was conducted at the motion capture studio of Toei Tokyo Studio from 2017 to 2019. In addition to sign language video data (original MXF and mp4 formats) from 4K or full HD cameras installed in front and left/right, 3D motion data (BVH, C3D, and FBX formats) from optical motion capture and depth data from Kinect sensors (Kinect v2 xef format of Kinect v2) are also included. As the second phase, 1,172 words and 7 dialogs are provided.
-
National Museum of Ethnology, Homō loquēns 'talking human' Wonders of Language and Languages
Yuji Nagashima, Daisuke Hara, Yasuo Horiuchi, Shinji Sako
2022.09 - 2022.11
Work type:Database science Location:National Museum of Ethnology
A technical exhibit introducing the high-precision sign language database KoSign was presented at the National Museum of Ethnology's special exhibition Homō loquēns "Talking Humans" Wonders of Language and Languages. The hand movements and facial expressions of sign language can be recorded as digital data precisely by using motion capturing system. The huge amount of data recorded of thousands of Japanese Sign Languages used in daily life enables the analysis of the sign language and the expression of the sign language by avatars.
-
Kogakuin University Japanese Sign Language Multi-Dimensional Database
Yuji Nagashima, Daisuke Hara, Yasuo Horiuchi, Shinji Sako
2021.06
Work type:Database science
The purpose of this dataset is to create a general-purpose sign language database that can be used in a variety of research fields. The data set contains as much high-definition and high-accuracy data as possible for more than 6,000 sign words and a few dialogues selected by the project. The subjects were two native Japanese signers (one male and one female) from native sign language families, and the filming was conducted at the motion capture studio of Toei Tokyo Studio from 2017 to 2019. In addition to sign language video data (original MXF and mp4 formats) from 4K or full HD cameras installed in front and left/right, 3D motion data (BVH, C3D, and FBX formats) from optical motion capture and depth data from Kinect sensors (Kinect v2 xef format of Kinect v2) are also included. The first phase of the project initially provide 3,701 words, 3 dialogues, and a dedicated analysis tool (Drawing and Annotation Support System). The total data size is approximately 3.6 TB.
-
Teppei Miura, Shinji Sako
2020.08
Work type:Database science
The dataset comprises of 7 subjects, covering the 16 sentences with 3-4 times per subject.
Archived dataset size is 1.52 GB.
The dataset-tree is comprised such as below:
NIT-3DHP-OMNI
+ A (personal ID for paper)
| + 011001001 (personal ID & sentence & times for each 3 digit)
| | + input
| | | + 0000000001.jpg (RGB image)
| | | + 0000000002.jpg
| | | + ...
| | |
| | + target
| | + 0000000001.txt (3D joint positions)
| | + 0000000002.txt
| | + ...
| |
| + 011001002 ...
|
+ B ...
The target text holds 3D joint positions data such as below order:
-------------------
Time Stamp
Head
Neck
Torso
Waist
Left Shoulder
Right Shoulder
Left Elbow
Right Elbow
Left Wrist
Right Wrist
Left Hand
Right Hand
------------------- -
Pressivo: 旋律の演奏表情を考慮した自動伴奏生成システム
宮田 佳奈, 酒向 慎司, 北村 正
2014.02
Work type:Software Location:インタラクション2014
-
A stochastic model of artistic deviation and its musical score for the elucidation of performance expression
K. Okumura,S. Sako,T. Kitamura
2013.08
Work type:Software Location:Stockholm, Sweden
http://smac2013.renconmusic.org/
-
Ryry: Automatic Accompaniment System Capable of Polyphonic Instruments
Ryuichi Yamamoto,Shinj Sako,Tadashi Kitamura
2013.03
Work type:Software
-
音楽印象データベース
酒向慎司,岩月靖典,西尾圭一郎,北村正
2013.03
Work type:Software
-
自動作曲システム Orpheus
嵯峨山茂樹,他
2013.01
Work type:Software
-
Open JTalk version 1.05
2011.12
Work type:Software
Other research activities
-
研究用マルチモーダル音声データベース M2TINIT
2003.03
研究用マルチモーダル音声データベース M2TINIT (Multi-Modal Speech Database by Tokyo Institute of Technology and Nagoya Institute of Technology) は、マルチモーダル音声研究の推進のため、東京工業大学大学院院総合理工学研究科 小林隆夫研究室および名古屋工業大学知能情報システム学科 北村・徳田研究室が開発・公開する音声・唇動画像同時収録データベースです。これまでに音声・唇動画像の生成やバイモーダル音声認識の研究に利用されています。
Awards
-
Best Presentation Award, The Tokai Chapter of Acoustical Society of Japan
2021.12 The Tokai Chapter of Acoustical Society of Japan Dynamics Restoration for "Loud" Popular Music
Hyuga Ozeki
Award type:Award from Japanese society, conference, symposium, etc. Country:Japan
-
Student Encouraging Award
2021.09 Information Processing Society of Japan Dynamics Restoration for "Loud" Popular Music
Hyuga Ozeki, Shinji Sako
Award type:Award from Japanese society, conference, symposium, etc. Country:Japan
-
Japan Society for Fuzzy Theory and Intelligent Informatics Best Paper Award
2017.09 Japan Society for Fuzzy Theory and Intelligent Informatics Automatic Performance Rendering Method for Keyboard Instruments based on Statistical Model that Associates Performance Expression and Musical Notation
Kenta Okumura, Shinji Sako, Tadashi Kitamura
Award type:Honored in official journal of a scientific society, scientific journal Country:Japan
This paper proposes a method for the automatic rendition of performances without losing any characteristics of the specific performer. In many of existing methods, users are required to input expertise such as possessed by the performer. Although they are useful in support of users'own performances, they are not suitable for the purpose of this proposal. The proposed method defines a model that associates the feature quantities of expression extracted from the case of actual performance with its directions that can be surely retrieved from musical score without using expertise. By classifying expressive tendency of the expression of the model for each case of performance using the criteria based on score directions, the rules that elucidate the causal relationship between the performer's specific performance expression and the score directions systematically can be structured. The candidates of the performance cases corresponding to the unseen score directions is obtained by tracing this structure. Dynamic programming is applied to solve the problem of searching the sequence of performance cases with the optimal expression from among these candidates. Objective evaluations indicated that the proposed method is able to efficiently render optimal performances. From subjective evaluations, the quality of rendered expression by the proposed method was confirmed. It was also shown that the characteristics of the performer could be reproduced even in various compositions. Furthermore, performances rendered via the proposed method have won the first prize in the autonomous section of a performance rendering contest for computer systems.
-
IPSJ Yamashita SIG Research Award
2016.03 Information Processing Society of Japan A study of comparative analysis of music performances based on the statistical model that associates expression and notation
Kenta Okumura, Shinji Sako, Tadashi Kitamura
Award type:Award from Japanese society, conference, symposium, etc. Country:Japan
-
78th National Convention of IPSJ, Student Encouragement Award
2016.03 Information Processing Society of Japan A case-based approach to the melody transformation for automatic jazz arrangement
Naoto Sato, Shinji Sako, Tadashi Kitamura
Award type:Award from Japanese society, conference, symposium, etc. Country:Japan
-
学会活動貢献賞
2014.03 日本音響学会東海支部
酒向 慎司
Country:Japan
-
76th National Convention of IPSJ, Student Encouragement Award
2014.03 Information Processing Society of Japan Automatic Accompaniment Generation Reflecting Musical Expression of Melody
Kana Miyata, Shinji Sako, Tadashi Kitamura
Award type:Award from Japanese society, conference, symposium, etc. Country:Japan
-
76th National Convention of IPSJ, Student Encouragement Award
2014.03 Information Processing Society of Japan Music retrieval system for any words using impression space: Improvement of mapping words and reconstruction of evaluation
Ai Zukawa, Shinji Sako, Tadashi Kitamura
Award type:Award from Japanese society, conference, symposium, etc. Country:Japan
-
Acoustical Society of Japan, Tokai Buranchi, Best Presentation Award
2013.09 Acoustical Society of Japan Audio to Score Alignment using Semi-Markov Conditional Random Fields
Ryuichi Yamamoto, Shinji Sako, Tadashi Kitamura
Award type:Award from Japanese society, conference, symposium, etc. Country:Japan
-
Forum on Information Technology Encouragement Award 2013
2013.09 Information Processing Society of Japan Statistical Modeling and Parameter Learning for Violin Fingering Estimation Corresponding to Skill Level
Nagata Wakana, Shinji Sako, Tadashi Kitamura
Award type:Award from Japanese society, conference, symposium, etc. Country:Japan
Scientific Research Funds Acquisition Results
-
一人称視点映像を用いた手話対話の支援技術および記録技術基盤の構築
Grant number:23K11197 2023.04 - 2026.03
日本学術振興会 科学研究費補助金 基盤研究(C)
酒向 慎司
Authorship:Principal investigator Grant type:Competitive
我々は深層学習を用いた手話翻訳システムを開発しているが,これを実現するには手話認識や意味解析などが必要である.それらには手話コーパスの構築や教師あり学習による深層学習向けのラベル付きデータが大量に必要であるが,ラベル付けには手間がかかる.そこで本研究ではラベルがない手話動画に対して,ラベル付けを半自動的に行うシステムを開発・公開する.本研究では,このシステムを用いて作成したラベル付き手話データセットを手話言語学研究者や手話工学研究者らに提供し,手話の意味解析や手話認識に関する研究をサポートする.
-
手話コーパス,深層学習向けラベル付き手話データ半自動生成システムの開発
Grant number:22H00661 2022.04 - 2026.03
日本学術振興会 科学研究費補助金 基盤研究(B)
木村 勉
Authorship:Coinvestigator(s) Grant type:Competitive
我々は深層学習を用いた手話翻訳システムを開発しているが,これを実現するには手話認識や意味解析などが必要である.それらには手話コーパスの構築や教師あり学習による深層学習向けのラベル付きデータが大量に必要であるが,ラベル付けには手間がかかる.そこで本研究ではラベルがない手話動画に対して,ラベル付けを半自動的に行うシステムを開発・公開する.本研究では,このシステムを用いて作成したラベル付き手話データセットを手話言語学研究者や手話工学研究者らに提供し,手話の意味解析や手話認識に関する研究をサポートする.
-
視覚障害者が能動的に白杖で叩くことによる音情報の作製と利用に関する基礎的研究
Grant number:18K18698 2018.04 - 2022.03
日本学術振興会 科学研究費補助金 挑戦的萌芽研究
布川 清彦
Authorship:Coinvestigator(s) Grant type:Competitive
-
多用途型日本手話言語データベース構築に関する研究 International coauthorship
Grant number:17H06114 2017.07 - 2021.03
科学研究費補助金 基盤研究(S)
長嶋 祐二
Authorship:Coinvestigator(s) Grant type:Competitive
Grant amount:\141960000 ( Direct Cost: \109200000 、 Indirect Cost:\32760000 )
本研究では、男女2名による言語資料提供者により、高精度・高精細な3次元動作・映像・深度データによる6,359単語の手話単語データベースKoSignを構築した。さらに、世界初となる対話の高精細・高精度の3次元動作と映像データの収録も行った。収録対話は、より有意義なデータとするため、単語、表情などの抽出、翻訳などのアノテーションを行った。また、アノテーションを支援するためのツール(MAT)の構築も行った。KoSignとMATは、手話研究推進を目的としてNII IDRより2021年5月25日に第1期分3,701語彙とアノテーション付きの3対話を公開した。
-
演奏者の個人性を転写する演奏生成と協調演奏システムの研究 International coauthorship
Grant number:15KK0008 2016.04 - 2019.03
日本学術振興会 科学研究費補助金 国際共同研究加速基金
酒向 慎司
Authorship:Principal investigator Grant type:Competitive
Grant amount:\12090000 ( Direct Cost: \9300000 、 Indirect Cost:\2790000 )
音響信号による楽譜追跡技術の高度化に取り組み、基本的な音符列の情報だけでなく、打楽器やメロディなど楽譜情報を活用する新たな演奏追跡手法を開発した。RWC音楽データセットを用いたシミュレーション実験により、リアルタイム性を損なうことなく楽譜追跡精度の改善が可能であることを示した。また、演奏動作の指形状変化を取得する画像処理手法として、演奏中のマルチモーダルデータの構築を行うとともに、畳み込みニューラルネットワークによる手形状認識手法において、精密3次元手形状モデルによって疑似的な画像を多数生成することによって学習データセットを拡張し、実写画像に対する認識精度が大きく向上することを確認した。
Past of Commissioned Research
-
繊維産業に於けるAI自動検査システムの構築に関する研究開発
2022.10 - 2025.03
愛知県 知の拠点あいち重点研究プロジェクト プロジェクトDX General Consignment Study
Authorship:Coinvestigator(s) Grant type:Collaborative (industry/university)
本課題では繊維産業の自動化のために、画像処理を用いた繊維の検品工程の自動化と、音響処理技術を用いた織機の異常検知の自動
化を目指す。繊維産業を含む全ての製造産業において、製品のチェックを行う検品工程は、製品の信頼性を担保するため重要であ
る。しかし、繊維産業における検品はほぼ全て熟練者による目視で行われており、自動化による効率化を妨げている。また、製造機械
のメンテナンスも同様に製品の信頼性向上に不可欠であるが、こちらの故障検知についても同様に人の経験に基づくところが大きい。
そこで本課題では、繊維を観測した画像を画像処理技術により解析することで、検品を自動化する方法を目指す。同様に織機が発する
音を音響処理技術により解析することで、織機の異常を検知する方法の確立を目指す。以上のように、本課題ではAIに基づく画像処
理・音響処理技術を利用することで、繊維産業における検査工程を自動化することを目指す。 -
手話の自動翻訳を実現させる高精度な動作検出と動作のパターンマッチングの技術開発
2016.10 - 2019.03
経済産業省 戦略的基盤技術高度化支援事業(サポイン) General Consignment Study
青井 基行
Authorship:Coinvestigator(s) Grant type:Competitive
-
2015.01 - 2015.12
科学技術振興機構 研究成果最適展開支援事業(A-STEP)FSステージ General Consignment Study
酒向 慎司
Authorship:Principal investigator Grant type:Competitive
Grant amount:\2210000 ( Direct Cost: \1700000 、 Indirect Cost:\510000 )
本研究では、自動演奏システムにおいて重要な要素技術である、演奏追跡技術の高精度化と、演奏追跡技術を応用した人間の演奏に同期するロボットの開発を行った。演奏追跡技術では、楽譜の情報を活用することで、テンポ変動を把握しやすい打楽器音とそれ以外の楽器種別を考慮した新たな演奏追跡モデルを提案し、演奏追跡精度の改善を確認した。演奏に追従するロボットの開発では、テンポ変動を含んだ演奏情報にリアルタイムで追従しロボットを制御するシステムを産業ロボットメーカーと共同で開発し、国際ロボット展に出展し実演した。
-
多様な利用形態に柔軟に対応する自動伴奏リハビリ支援システムの開発
2013.08 - 2014.03
科学技術振興機構 研究成果最適展開支援事業(A-STEP)FSステージ General Consignment Study
酒向 慎司
Authorship:Principal investigator Grant type:Competitive
Grant amount:\2210000 ( Direct Cost: \1700000 、 Indirect Cost:\510000 )
楽器の演奏は趣味として楽しむだけでなく、複雑な身体動作を伴うことから身体機能や脳機能のリハビリとしても期待できる。楽器演奏によるリハビリ支援で重要なポイントは、支援の度合いが人それぞれであり、利用者の要望や制約に柔軟に対処できることが重要となる。利用者を問わない楽器演奏によるリハビリ支援システムの構築を念頭に、楽器の違いに頑健なスペクトルテンプレートの自動適応手法の検討、テンポ推定精度の高度化を検討するほか、実際の演奏におけるテンポ推定誤りの影響などを調査した。また、計算量と性能の関係を調査するとともに、実時間処理に向けたアルゴリズムの改善を行った。
-
ユーザーの嗜好と利用シーンの変動に対応可能な統計モデルに基づいた楽曲からの感性推定モデルの研究
2011.08 - 2012.03
科学技術振興機構 研究成果最適展開支援事業(A-STEP)FSステージ General Consignment Study
酒向 慎司
Authorship:Principal investigator Grant type:Competitive
Grant amount:\2210000 ( Direct Cost: \1700000 、 Indirect Cost:\510000 )
音楽から受ける印象を楽曲の電子データから直接推定する印象推定システムにおいて、個人の嗜好や感性の違いに対応するため、性別や音楽経験などからなるプロフィールを利用する新たな手法を開発した。この手法の特徴として、印象推定モデルを学習するための音楽を聴いたときの印象データを事前に収集する必要がなく、他者の印象推定モデルから、特定の利用者に合った(類似した)モデルをプロフィールの情報に基づいて自動選択することができる。また、音楽を聴いた際の印象データを短期間で効率的に収集するため、Webブラウザを利用した楽曲提示と印象データ収集システムを構築し、様々な年代を含む120名の大規模な印象評価データを収集した。
Committee Memberships
-
電子情報通信学会 ヒューマンコミュニケーションシンポジウム2022運営委員
2022.10 - 2022.12
Committee type:Academic society
-
Information Processing Society of Japan IPSJ Special Interest Group on Music and Computer
2022.05
Committee type:Academic society
-
電子情報通信学会 第21回情報科学技術フォーラム研究会担当委員・プログラム委員
2022.01 - 2022.09
Committee type:Academic society
-
The Institute of Electronics, Information and Communication Engineers Language as Real-time Communication, Chairman
2021.04
Committee type:Academic society
-
The Institute of Electronics, Information and Communication Engineers IEICE Well-being Information Technology (WIT), Chairman
2021.04
Committee type:Academic society
-
The Institute of Electronics, Information and Communication Engineers IEICE Well-being Information Technology (WIT), Vice Chairman
2019.04 - 2021.03
Committee type:Academic society
-
Information Processing Society of Japan Editorial Board Member
2019.04 - 2020.03
Committee type:Academic society
-
電子情報通信学会 第18回情報科学技術フォーラム研究会担当委員・プログラム委員
2018.12 - 2019.09
Committee type:Academic society
-
電子情報通信学会 ヒューマンコミュニケーションシンポジウム2018プログラム委員長
2018.07 - 2019.01
Committee type:Academic society
-
Information Processing Society of Japan IPSJ Special Interest Group on Music and Computer
2018.04 - 2022.05
Committee type:Academic society
Social Activities
-
生産現場での動作音の異常検知・予知技術開発
Role(s): Lecturer
尾張繊維技術センター オンライン (Zoom) 2022.10
Audience: Researchesrs
Type:Visiting lecture