SAKO Shinji


Department of Computer Science
Department of Computer Science
Center for Research on Assistive Technology for Building a New Community


Associate Professor

Sako Shinji

  • Ph.D. (Engineering) ( 2004.03   Nagoya Institute of Technology )

Research Interests

  • Music Signal Processing

  • Music Information Processing

  • Sign Language Recognition

  • Singing Voice Synthesis

  • Speech Synthesis

Research Areas

  • Life Science / Rehabilitation science

  • Informatics / Kansei informatics

  • Informatics / Perceptual information processing

From School

  • Nagoya Institute of Technology   Faculty of Engineering   Department of Intelligent Information Systems   Graduated

    1995.04 - 1999.03

From Graduate School

  • Nagoya Institute of Technology   Graduate School, Division of Engineering   Department of Electrical & Computer Engineering   Doctor's Course   Completed

    2001.04 - 2004.03

External Career

  • Advanced Telecommunications Research Institute International

    2003.04 - 2003.06

  • The University of Tokyo   Graduate School of Information Science and Technology   Research Assistant

    2004.04 - 2007.03

  • AGH University of Science and Technology   Faculty of Computer Science, Electronics and Telecommunications   Guest Scientists

    2014.07 - 2014.08

  • Technical University Munich   Institute for Human-Machine Communication   Guest Scientists

    2012.06 - 2012.12

  • Technical University of Munich   Institute for Human-Machine Communication   JSPS Scientist for Joint lntemational Research

    2016.07 - 2017.03

Professional Memberships

  • Japanese Association of Sign Linguistics


  • Human Interface Society


  • 電気関係学会東海支部連合大会実行委員会

    2009.04 - 2009.12

  • 高度言語情報融合フォーラム


  • The Institute of Image Information and Television Enginerrs


Qualification Acquired

  • Software Design & Development Engineer/Information Processing Engineer, Class 1



  • Dynamic Hand Gesture Recognition for Human-Robot Collaborative Assembly Reviewed International journal

    Bogdan Kwolek, Shinji Sako

    ICAISC 2023: Artificial Intelligence and Soft Computing, Lecture Notes in Computer Science   14125   112 - 121   2023.06

    Authorship:Last author   Language:English   Publishing type:Research paper (international conference proceedings)  

    In this work, we propose a novel framework for gesture recognition for human-robot collaborative assembly. It permits recognition of dynamic hand gestures and their duration to automate planning the assembly or common human-robot workspaces according to Methods-Time-Measurement recommendations. In the proposed approach the common workspace of a worker and Franka-Emika robot is observed by an overhead RGB camera. A spatio-temporal graph convolutional neural network operating on 3D hand joints extracted by MediaPipe is used to recognize hand motions in manual assembly tasks. It predicts five motion sequences: grasp, move, position, release, and reach. We present experimental results of gesture recognition achieved by a spatio-temporal graph convolutional neural network on real RGB image sequences.

    DOI: 10.1007/978-3-031-42505-9_10

  • 3D Ego-Pose Lift-Up Robustness Study for Fisheye Camera Perturbations Reviewed International journal

    Teppei Miura, Shinji Sako, Tsutomu Kimura

    Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications   4   600 - 606   2023.02

    Language:English   Publishing type:Research paper (international conference proceedings)  

    3D egocentric human pose estimations from a mounted fisheye camera have been developed following the advances in convolutional neural networks and synthetic data generations. The camera captures different images that are affected by the optical properties, the mounted position, and the camera perturbations caused by body motion. Therefore, data collecting and model training are main challenges to estimate 3D ego-pose from a mounted fisheye camera. Past works proposed synthetic data generations and two-step estimation model that consisted of 2D human pose estimation and subsequent 3D lift-up to overcome the tasks. However, the works insufficiently verify robustness for the camera perturbations. In this paper, we evaluate existing models for robustness using a synthetic dataset with the camera perturbations that increases in several steps. Our study provides useful knowledges to introduce 3D ego-pose estimation for a mounted fisheye camera in practical.

    DOI: 10.5220/0011661000003417

  • Visualization of Affective Information in Music Using Chironomie Reviewed International journal


    Authorship:Last author   Language:English   Publishing type:Research paper (international conference proceedings)  

  • Simple yet effective 3D ego-pose lift-up based on vector and distance for a mounted omnidirectional camera Reviewed International journal

    Teppei Miura, Shinji Sako

    Applied Intelligence   2022.05

    Language:English   Publishing type:Research paper (scientific journal)   Publisher:Springer  

    Following the advances in convolutional neural networks and synthetic data generation, 3D egocentric body pose estimations from a mounted fisheye camera have been developed. Previous works estimated 3D joint positions from raw image pixels and intermediate supervision during the process. The mounted fisheye camera captures notably different images that are affected by the optical properties of the lens, angle of views, and setup positions. Therefore, 3D ego-pose estimation from a mounted fisheye camera must be trained for each set of camera optics and setup. We propose a 3D ego-pose estimation from a single mounted omnidirectional camera that captures the entire circumference by back-to-back dual fisheye cameras. The omnidirectional camera can capture the user’s body in the 360∘ field of view under a wide variety of motions. We also propose a simple feed-forward network model to estimate 3D joint positions from 2D joint locations. The lift-up model can be used in real time yet obtains accuracy comparable to those of previous works on our new dataset. Moreover, our model is trainable with the ground truth 3D joint positions and the unit vectors toward the 3D joint positions, which are easily generated from existing publicly available 3D mocap datasets. This advantage alleviates the data collection and training burden due to changes in the camera optics and setups, although it is limited to the effect after the 2D joint location estimation.

    DOI: 10.1007/s10489-022-03417-3

  • 3D skeleton motion generation of double bass from musical score Reviewed International journal

    Takeru Shirai, Shinji Sako

    15th International Symposium on Computer Music Multidisciplinary Research (CMMR)   41 - 46   2021.11

    Language:English   Publishing type:Research paper (international conference proceedings)  

    In this study, we propose a method for generating 3D skeleton motions of a double bass player from musical score information using a 2-layer LSTM network. Since there is no suitable dataset for this study, we have created a new motion dataset with actual double bass performance. The contribution of this paper is to show the effect of combining bowing and fingering information in the generation of performance motion, and to examine the effective model structure in performance generation. Both objective and subjective evaluations showed that the accuracy of generating performance motion for double bass can be improved using two types of additional information (bowing, fingering information) and improved by constructing a model that takes into account bowing and fingering.

  • SynSLaG: Synthetic Sign Language Generator Reviewed International journal

    Teppei Miura, Shinji Sako

    ASSETS '21: The 23rd International ACM SIGACCESS Conference on Computers and Accessibility   ( 90 )   1 - 4   2021.10

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:Association for Computing Machinery  

    Machine learning techniques have the potential to play an important role in sign language recognition. However, sign language datasets lack the volume and variety necessary to work well. To enlarge these datasets, we introduce SynSLaG, a tool that synthetically generates sign language datasets from 3D motion capture data. SynSLaG generates realistic images of various body shapes with ground truth 2D/3D poses, depth maps, body-part segmentations, optical flows, and surface normals. The large synthetic datasets provide possibilities for advancing sign language recognition and analysis.

    DOI: 10.1145/3441852.3476519

  • Recognition of JSL fingerspelling using Deep Convolutional Neural Networks Reviewed International journal

    Bogdan Kwolek, Wojciech Baczynski, Shinji Sako

    Neurocomputing   2021.06

    Language:English   Publishing type:Research paper (scientific journal)  

    In this paper, we present approach for recognition of static fingerspelling in Japanese Sign Language on RGB images. Two 3D articulated hand models have been developed to generate synthetic fingerspellings and to extend a dataset consisting of real hand gestures.In the first approach, advanced graphics techniques were employed to rasterize photorealistic gestures using a skinned hand model. In the second approach, gestures rendered using simpler lighting techniques were post-processed by a modified Generative Adversarial Network. In order to avoid generation of unrealistic fingerspellings a hand segmentation term has been added to the loss function of the GAN. The segmentation of the hand in images with complex background was done by proposed ResNet34-based segmentation network. The finger-spelled signs were recognized by an ensemble with both fine-tuned and trained from scratch neural networks. Experimental results demonstrate that owing to sufficient amount of training data a high recognition rate can be attained on RGB images. The JSL dataset with pixel-level hand segmentations is available for download.

    DOI: 10.1016/j.neucom.2021.03.133

  • Fingerspelling recognition using synthetic images and deep transfer learning Reviewed

    Nguyen Tu Nam, Shinji Sako, Bogdan Kwolek

    2020 The 13th International Conference on Machine Vision (ICMV 2020)   2020.11

    Language:English   Publishing type:Research paper (international conference proceedings)  

    Although gesture recognition has been intensely studied for decades, it is still a challenging research topic due to difficulties posed by background complexity, occlusion, viewpoint, lighting changes, the deformable and articulated nature of hands, etc. Numerous studies have shown that extending the training dataset with real images about synthetic images improves the recognition accuracy. However, little work is devoted to demonstrate what improvements in recognition can be achieved thanks to transferring the style onto synthetically generated images from the real gestures. In this paper, we propose a novel method for Japanese fingerspelling recognition using both real and synthetic images generated on the basis of a 3D hand model. We propose to employ a neural style transfer to include information from real images onto synthetically generated dataset. We demonstrate experimentally that neural style transfer and discriminative layer training applied to training deep neural models allow obtaining considerable gains in the recognition accuracy.

  • Study on Effective Combination of Features for Non-word Speech Recognition of Phonological Examination Reviewed

    Toshiharu Tadano,Masahiko Nawate,Fumihito Ito,Shinji Sako

    IPSJ Journal   61 ( 10 )   1647 - 1657   2020.10

    Language:Japanese   Publishing type:Research paper (scientific journal)   Publisher:Information Processing Society of Japan  

    Developmental dyslexia is a main element of learning disability and its early detection is very important for intervention and reading treatment. A convenient screening test using PC has been published and the answer times in text reading, reversed reading of word and mora skip of word are automatically recorded in the test. However, the correctness determination must be done by tester. In order to automate those test, a speech recognition technology corresponding to a non-word that are non-meaningful words used in an examination is necessary, but in conventional speech recognition, recognition precision for non-words is low. Therefore, while reinforcing the function of conventional speech recognition, the accuracy for non-words to be improved to a level that can be practically used for phoneme examination. In this study, we have tried to improve the accuracy for non-words by incorporating a mechanism to determine non-word correctness into Julius, which is in the public domain and can be modified freely. In addition, six candidates are given as feature quantities of speech, and the trend of the accuracy by the combination is examined. As a result, depending on the target non-word, the accuracy was 75.0% to 95.0%, and the overall average value was 87.5%.

  • 3D human pose estimation model using location-maps for distorted and disconnected images by a wearable omnidirectional camera Reviewed International journal

    Teppei Miura, Shinji Sako

    IPSJ Transactions on Computer Vision and Applications   12 ( 4 )   1 - 17   2020.08

     More details

    We address a 3D human pose estimation for equirectangular images taken by a wearable omnidirectional camera. The equirectangular image is distorted because the omnidirectional camera is attached closely in front of a person’s neck. Furthermore, some parts of the body are disconnected on the image; for instance, when a hand goes out to an edge of the image, the hand comes in from another edge. The distortion and disconnection of images make 3D pose estimation challenging. To overcome this difficulty, we introduce the location-maps method proposed by Mehta et al.; however, the method was used to estimate 3D human poses only for regular images without distortion and disconnection. We focus on a characteristic of the location-maps that can extend 2D joint locations to 3D positions with respect to 2D-3D consistency without considering kinematic model restrictions and optical properties. In addition, we collect a new dataset that is composed of equirectangular images and synchronized 3D joint positions for training and evaluation. We validate the location-maps’ capability to estimate 3D human poses for distorted and disconnected images. We propose a new location-maps-based model by replacing the backbone network with a state-of-the-art 2D human pose estimation model (HRNet). Our model is a simpler architecture than the reference model proposed by Mehta et al. Nevertheless, our model indicates better performance with respect to accuracy and computation complexity. Finally, we analyze the location-maps method from two perspectives: the map variance and the map scale. Therefore, some location-maps characteristics are revealed that (1) the map variance affects robustness to extend 2D joint locations to 3D positions for the 2D estimation error, and (2) the 3D position accuracy is related to the 2D locations relative accuracy to the map scale.

    DOI: 10.1186/s41074-020-00066-8

Books and Other Publications


  • Envisioning the Future: A Human Communication Research Perspective Invited

    Sumaru Niida, Tomoyasu Komori, Shinji Sako, Akihiro Tanaka, Kiyohiko Nunokawa

    Te Journal of Institute of Electronics, Information and Communication Engineers   107 ( 3 )   237 - 243   2024.03

    Authorship:Last author   Language:Japanese   Publishing type:Article, review, commentary, editorial, etc. (scientific journal)   Publisher:Institute of Electronics, Information and Communication Engineers  

    Other Link:

  • Accessibility Guidelines for Papers and Presentations towards Realizing Inclusive Society Invited

    Kiyohiko Nunokawa, Daisuke Wakatsuki, Shinji Sako

    Te Journal of Institute of Electronics, Information and Communication Engineers   106 ( 12 )   1108 - 1114   2023.12

     More details

    In FY2023, the Accessibility Guidelines for Writing and Presenting Papers was revised to ver. 4.0. This paper introduces the background of the revision and explains the relationship between the Guidelines and the social background of the revision, namely, the Law on Elimination of Discrimination against Persons with Disabilities, which makes it mandatory to provide reasonable accommodation for persons with disabilities when they participate in academic societies and research groups.

    Other Link:

  • ICF and Accessibility Guidelines for Papers and Presentations Invited

    Kiyohiko Nunokawa, Daisuke Wakatsuki, Shinji Sako

    Te Journal of Institute of Electronics, Information and Communication Engineers   106 ( 12 )   1115 - 1119   2023.12

     More details

    The Accessibility Guidelines for Writing and Publishing Papers was revised to ver 4.0 in FY2023. This paper introduces the International Classification of Functional Factors (ICF), a world-standard view of disability that considers disability as a negative aspect of life function, and explains its relationship to the guidelines.

    Other Link:

  • HMM-based Automatic Sign Language Recognition using Phonemic Structure of Japanese Sign Language

    Shinji Sako, Tadashi Kitamura

    Journa of The Japan Society for Welfare Engineering   17 ( 2 )   2 - 7   2015.11

    Authorship:Lead author   Language:Japanese   Publishing type:Article, review, commentary, editorial, etc. (international conference proceedings)   Publisher:Japan Society for Welfare Engineering  

    CiNii Articles

  • Speech/Sound based Human Interfaces (1) Construction of Speech Synthesis Systems using HTS Reviewed

    Keiichiro Oura, Heiga Zen, Shinji Sako, Keiichi Tokuda

    Human interface   12 ( 1 )   35 - 40   2010.02

     More details

    CiNii Articles

  • 特集 音楽とOR―日本語歌詞からの自動作曲 Invited

    嵯峨山 茂樹,中妻 啓,深山 覚,酒向 慎司,西本 卓也

    Operations research as a management science   54 ( 9 )   546 - 553   2009.10

     More details

    CiNii Articles


  • 音響情報と歌詞を用いた楽曲のレビュー文生成

    川地 奎多, 酒向 慎司

    情報処理学会 第141回音楽情報科学研究会  2024.08  情報処理学会

    Event date: 2024.08

    Language:Japanese   Presentation type:Oral presentation (general)  

    Venue:駒澤大学 駒澤キャンパス  

    近年,音楽配信サービスの普及により,楽曲へのアクセス性が大幅に向上した.その一方で,音楽の聴取スタイルは受動的かつ BGM として消費する傾向が強まり,深く鑑賞する機会が減少しているのではないかと感じている.そこで,本研究では音楽を言語化して説明することがリスナーの音楽理解を助け,音楽体験の価値を向上させる手段の 1 つであると考えた.音楽の言語化は,音楽キャプションタスク(音楽に関する情報を自然言語の文章形式で記述するタスク)として近年盛んに研究されている.従来の研究では音響情報のみを用いて,楽曲に関する説明文を生成することに焦点が置かれていた.そこで,本研究では音響情報に加えて歌詞にも着目し,楽曲のレビュー文を生成することに試みた.具体的には音楽特徴抽出器と大規模言語モデル(LLM)を用いて音楽記述を生成する MU-LLaMA をベースラインモデルとし,LLaMA に事前に指示を与えるシステムプロンプトを設計することで,歌詞も考慮したレビュー文生成を実現した.さらに,3 つの評価実験を通じて,提案手法が従来手法よりも表現の多様性や楽曲のイメージ形成に有効であることを確認した.

  • A study of mousing detection in Japanese Sign Language video for annotation support

    Kana Tatsumi, Shinji Sako

    IEICE 125th Technical Committee on Well-being Information Technology (WIT)  2024.08  Institute of Electronics, Information and Communication Engineers

    Event date: 2024.08

    Language:Japanese   Presentation type:Oral presentation (general)  

    Venue:Future University Hakodate  

    We studied the detection of mouthing, a type of mouth movement in Japanese Sign Language, to support sign language annotation. There is a need to develop a larger and more versatile corpus of Japanese Sign Language. However, annotation is difficult due to the complexity of sign language expressions. Therefore, we would like to automate the annotation of signs so that the corpus can be developed more efficiently. In the development of BOBSL, they used a technology that recognizes mouth shapes for automatic annotation. In this study, we use existing machine lip-reading technology to recognize mouth shapes in Japanese Sign Language. It then detects mouth movements, called mouthing, by matching the recognition results with mouthing words identified from the translated text. We also collected videos of Japanese Sign Language to create our dataset. At that time, we investigated the expression patterns of mousing and found variations. Finally, we considered improvement suggestions of the proposed method for future research.

  • Estimation of lighting color, brightness, and movement based on music audio signals to support stage lighting and its evaluation

    Nano GATTO, Shinji Sako

    IPSJ 140th Special Interest Group on MUSic and computer (SIGMUS)  2024.05  Information Processing Society of Japan

    Event date: 2024.05

    Language:Japanese   Presentation type:Oral presentation (general)  

    Venue:Nihon University, College of Humanities and Sciences  

  • Exploring Individuality in Sign Language using Japanese Sign Language Video Data

    Zixuan DAI, Shinji SAKO

    Technical Committee on Human Communication Science  2024.03  Institute of Electronics, Information and Communication Engineers

    Event date: 2024.05

    Language:English   Presentation type:Oral presentation (general)  

    Venue:Okinawa Industry Support Center  

    In recent years, there have been growing expectations for technology to automatically recognize signs and generate sign language CG. It is believed that individuality is expressed in sign languages as in spoken languages, but not many studies have been conducted on the individuality of sign languages. In this study, we examined whether the individuality of sign movement can be analyzed for video data of Japanese Sign Language, referring to previous studies on the analysis of individuality and anonymity of signs in motion-capture data of French Sign Language.

  • Accessibility Guidelines for Papers and Presentations towards Realizing Inclusive Society -Advancing research that will help realize a cohesive society-

    Kiyohiko Nunokawa, Daisuke Wakatsuki, Shinji Sako

    IEICE 124th Technical Committee on Well-being Information Technology (WIT)  2024.03  Institute of Electronics, Information and Communication Engineers

    Event date: 2024.03

    Language:Japanese   Presentation type:Oral presentation (invited, special)  

    Venue:Tsukuba University of Technology  

    Around the world, there is an accelerating movement to promote human evolution through diversity, including people with disabilities. In Japan, too, legislation is being developed to realize a symbiotic society. The Human Communication Group (HCG) has developed and has been revising the "Accessibility Guidelines for Writing and Presenting Research Papers"[1] to enable people of all backgrounds to participate in research. This presentation will outline the background of the revision of the "Accessibility Guidelines for Writing and Presenting Research Papers Ver. 4" (hereinafter referred to as the "Guidelines") in FY2023, from the publication of the Guidelines to Ver. 4, and the Japanese law on the rights of persons with disabilities, Act on Elimination of Discrimination against Persons with Disabilities [2]. Then, the relationship between the Guidelines and the Law will then be explained.

  • ICF and Accessibility Guidelines for Papers and Presentations

    Kiyohiko Nunokawa, Daisuke Wakatsuki, Shinji Sako

    IEICE 124th Technical Committee on Well-being Information Technology (WIT)  2024.03  Institute of Electronics, Information and Communication Engineers

    Event date: 2024.03

    Language:Japanese   Presentation type:Oral presentation (invited, special)  

    Venue:Tsukuba University of Technology  

    In this presentation, We will provide an overview of the International Classification of Functioning, Disability, and Health (ICF), the global standard view of disability. It will then explain the relationship between the ICF and the Accessibility Guidelines for Papers and Presentations Ver. 4[1], which were revised in FY2023. The ICF defines disability as a negative aspect of functioning. It includes not only physical and mental characteristics such as not being able to see or hear, but also situations where a mismatch between a person's characteristics and the environment causes difficulties in daily life. Therefore, by creating an environment that matches a person's characteristics, it is possible to eliminate difficulties and improve function, in other words, to reduce disability and increase what persons can do in their research life. The guideline provides specific examples of environmental adjustments tailored to the characteristics of researchers with disabilities that are effective when researchers with disabilities work with others to advance their research. It is expected that the use of the Guideline will reduce the disabilities of researchers in their research activities and improve their functioning.

  • Estimation of lighting color, brightness, and movement based on the structure and mood of the music to support lighting direction

    Nano Gatto, Shinji Sako

    The 86th National Convention of IPSJ  2024.03  Information Processing Society of Japan

    Event date: 2024.03

    Language:Japanese   Presentation type:Oral presentation (general)  

    Venue:Knagawa University, Yokohama Campus  

    The purpose of this research is to assist beginners in composing lighting effects for music concerts by determining lighting color, brightness, and movement for each segment of music based on the repetitive structure of the music sound. By determining the lighting effects according to the actual lighting composition procedure, we aim to compose a lighting performance that matches the atmosphere of the music piece.

  • Estimation of respiration related to the temporal structure of sign language based on 3D data

    Kentaro Kasama, Shinji Sako

    The 86th National Convention of IPSJ  2024.03  Information Processing Society of Japan

    Event date: 2024.03

    Language:Japanese   Presentation type:Oral presentation (general)  

    Venue:Knagawa University, Yokohama Campus  

    This study aims to obtain respiration information from a sign language dataset in order to improve the naturalness of sign language production by taking into account respiration during signing, which is considered to be related to the temporal structure of sign language. We estimate respiration using 3D data from existing sign language datasets and verify that the estimated respiration indicates sign language-specific respiration.

  • Recognition of mouth actions in Japanese sign language video using lip reading technique

    Yuika Umeda, Shinji Sako

    The 86th National Convention of IPSJ  2024.03  Information Processing Society of Japan

    Event date: 2024.03

    Language:Japanese   Presentation type:Oral presentation (general)  

    Venue:Knagawa University, Yokohama Campus  

    This research aims at automatic annotation of Japanese Sign Language in order to solve the data shortage of Japanese Sign Language. Focusing on the annotation of mouth shapes in signs, the detection and recognition of mouth shapes in Japanese Sign Language are verified using a machine lip-reading model. For the validation, we will use a dataset created from video footage of Japanese Sign Language to confirm and evaluate the accuracy of detection and recognition of mouth shapes in Japanese Sign Language.

  • Music Visualization using Chironomie International coauthorship International conference

    Kana Tatsumi, Shinji Sako, Rafael Ramirez

    24nd International Society for Music Information Retrieval Conference  2023.11  International Society for Music Information Retrieval

    Event date: 2023.11

    Language:English   Presentation type:Poster presentation  

    Venue:Milan   Country:Italy  

    The purpose of this study is to debilitate the enjoyment of music for both hearing-impaired and normal-hearing individuals by visually representing music. In order to effectively and distinctively portray the musical rhythm, we focus on Chironomie, a conducting technique used in Gregorian chant. Generally, Chironomie is drawn by a curve that corresponds to the musical score, and this curve is determined by whether a short segment of the score represents one of two classes: Arsis or Thesis. In pursuit of our objective, our endeavors encompass two essential facets: the adaptation of Chironomie for Western tonal music to express intuitively perceivable musical features like tension and relaxation, and the evaluation whether Chironomie can effectively convey music visually. We report an automated method for estimating Arsis and Thesis within composite beats to generate Chironomie. Additionally, it presents evaluation experiments involving normal-hearing to assess the effectiveness of Chironomie.

Industrial Property Rights

  • 単語決定システム

    青井基行,赤津 舞子,三浦 七瀬,酒向 慎司

    Applicant:株式会社ユニオンソフトウェアマネイジメント,国立大学法人 名古屋工業大学

    Application no:特願2018-048022  Date applied:2018.03

    Country of applicant:Domestic   Country of acquisition:Domestic

  • 飲酒状態判定装置及び飲酒状態判定方法

    岩田 英三郎, 酒向 慎司

    Application no:PCT/JP2010/062776  Date applied:2010.07

    Announcement no:特開2011-553634  Date announced:2012.06

    Country of applicant:Domestic   Country of acquisition:Domestic



  • 音声合成方法及び装置

    嵯峨山 茂樹, 槐 武也, 酒向 慎司, 松本 恭輔, 西本 卓也

    Application no:特願2005-304082  Date applied:2005.10

    Announcement no:特開2007-114355  Date announced:2007.05

    Country of applicant:Domestic   Country of acquisition:Domestic



  • 音声認識装置及びコンピュータプログラム

    山口 辰彦, 酒向 慎司, 山本 博史, 菊井 玄一郎

    Application no:特願2003-317559  Date applied:2003.09

    Announcement no:特開2005-84436  Date announced:2005.03

    Country of applicant:Domestic   Country of acquisition:Domestic




  • 論文作成・発表アクセシビリティガイドライン (Ver.4.0)

    井上 正之, 苅田 知則, 今野 順, 坂本 隆, 酒向 慎司, 塩野目 剛亮, 布川 清彦, 南谷 和範, 宮城 愛美, 若月 大輔


    Work type:Educational material  




  • Kogakuin University Japanese Sign Language Multi-Dimensional Database (2nd Term)

    Yuji Nagashima, Daisuke Hara, Yasuo Horiuchi, Shinji Sako


    Work type:Database science  

    The purpose of this dataset is to create a general-purpose sign language database that can be used in a variety of research fields. The data set contains as much high-definition and high-accuracy data as possible for more than 6,000 sign words and a few dialogues selected by the project. The subjects were two native Japanese signers (one male and one female) from native sign language families, and the filming was conducted at the motion capture studio of Toei Tokyo Studio from 2017 to 2019. In addition to sign language video data (original MXF and mp4 formats) from 4K or full HD cameras installed in front and left/right, 3D motion data (BVH, C3D, and FBX formats) from optical motion capture and depth data from Kinect sensors (Kinect v2 xef format of Kinect v2) are also included. As the second phase, 1,172 words and 7 dialogs are provided.

  • National Museum of Ethnology, Homō loquēns 'talking human' Wonders of Language and Languages

    Yuji Nagashima, Daisuke Hara, Yasuo Horiuchi, Shinji Sako

    2022.09 - 2022.11

    Work type:Database science   Location:National Museum of Ethnology  

    A technical exhibit introducing the high-precision sign language database KoSign was presented at the National Museum of Ethnology's special exhibition Homō loquēns "Talking Humans" Wonders of Language and Languages. The hand movements and facial expressions of sign language can be recorded as digital data precisely by using motion capturing system. The huge amount of data recorded of thousands of Japanese Sign Languages used in daily life enables the analysis of the sign language and the expression of the sign language by avatars.

  • Kogakuin University Japanese Sign Language Multi-Dimensional Database

    Yuji Nagashima, Daisuke Hara, Yasuo Horiuchi, Shinji Sako


    Work type:Database science  

    The purpose of this dataset is to create a general-purpose sign language database that can be used in a variety of research fields. The data set contains as much high-definition and high-accuracy data as possible for more than 6,000 sign words and a few dialogues selected by the project. The subjects were two native Japanese signers (one male and one female) from native sign language families, and the filming was conducted at the motion capture studio of Toei Tokyo Studio from 2017 to 2019. In addition to sign language video data (original MXF and mp4 formats) from 4K or full HD cameras installed in front and left/right, 3D motion data (BVH, C3D, and FBX formats) from optical motion capture and depth data from Kinect sensors (Kinect v2 xef format of Kinect v2) are also included. The first phase of the project initially provide 3,701 words, 3 dialogues, and a dedicated analysis tool (Drawing and Annotation Support System). The total data size is approximately 3.6 TB.


    Teppei Miura, Shinji Sako


    Work type:Database science  

    The dataset comprises of 7 subjects, covering the 16 sentences with 3-4 times per subject.
    Archived dataset size is 1.52 GB.

    The dataset-tree is comprised such as below:
    + A (personal ID for paper)
    | + 011001001 (personal ID & sentence & times for each 3 digit)
    | | + input
    | | | + 0000000001.jpg (RGB image)
    | | | + 0000000002.jpg
    | | | + ...
    | | |
    | | + target
    | | + 0000000001.txt (3D joint positions)
    | | + 0000000002.txt
    | | + ...
    | |
    | + 011001002 ...
    + B ...

    The target text holds 3D joint positions data such as below order:
    Time Stamp
    Left Shoulder
    Right Shoulder
    Left Elbow
    Right Elbow
    Left Wrist
    Right Wrist
    Left Hand
    Right Hand

  • Pressivo: 旋律の演奏表情を考慮した自動伴奏生成システム

     More details

    Work type:Software   Location:インタラクション2014  

  • A stochastic model of artistic deviation and its musical score for the elucidation of performance expression

     More details

    Work type:Software   Location:Stockholm, Sweden

  • Ryry: Automatic Accompaniment System Capable of Polyphonic Instruments

     More details

    Work type:Software  

  • 音楽印象データベース



    Work type:Software  

  • 自動作曲システム Orpheus



    Work type:Software  

Other research activities

  • 研究用マルチモーダル音声データベース M2TINIT


    研究用マルチモーダル音声データベース M2TINIT (Multi-Modal Speech Database by Tokyo Institute of Technology and Nagoya Institute of Technology) は、マルチモーダル音声研究の推進のため、東京工業大学大学院院総合理工学研究科 小林隆夫研究室および名古屋工業大学知能情報システム学科 北村・徳田研究室が開発・公開する音声・唇動画像同時収録データベースです。これまでに音声・唇動画像の生成やバイモーダル音声認識の研究に利用されています。


  • Best Presentation Award: Best New Direction Category

    2024.08   Information Processing Society of Japan   Dynamics Restoration for "Loud" Popular Music

    Keita Kawachi, Shinji Sako

    Award type:Award from Japanese society, conference, symposium, etc.  Country:Japan

    In recent years, the proliferation of music distribution services has markedly enhanced the accessibility of music. Conversely, music listening is often a passive activity, consumed as background music. This has led to a perceived decline in opportunities for deep appreciation of music. Accordingly, this study posits that verbalizing and explaining music is an effective method to facilitate listeners' comprehension of music and enhance the value of their musical experience. The verbalization of music has been the subject of considerable recent study as a music captioning task, defined as a task to describe information about music in the form of natural language sentences. Prior research has concentrated on the generation of musical descriptions based solely on acoustic data. In this study, we sought to generate review sentences of songs by focusing on lyrics in addition to acoustic information. Specifically, MU-LLaMA, which generates music descriptions using a musical feature extractor and a large-scale language model (LLM), was used as the baseline model. Additionally, review text generation that also takes lyrics into account was achieved by designing system prompts that provide instructions to LLaMA in advance. Furthermore, through three evaluation experiments, it was confirmed that the proposed method is more effective than conventional methods in terms of diversity of expression and image formation of music.

  • WIT Student Research Award

    2023.12   The Institute of Electronics, Information and Communication Engineers, Well-begin Infromation Technology   Proposal for Music Visualization Method Using Chironomie for Enhancing Musical Experience of the Hearing Impaired

    Kana Tatsumi, Shinji Sako

    Award type:Award from Japanese society, conference, symposium, etc.  Country:Japan

    The aim of this study is to enable both hearing-impaired and normal-hearing to enjoy music together by visualizing it using Chironomie which is Gregorian chant conducting. To achieve this goal, the tasks include adapting Chironomie to Western music to express intuitively perceivable musical features like tension and relaxation. Additionally, evaluating whether Chironomie can effectively convey music visually. This report focuses on the investigation of a method for automatically estimating Arsis and Thesis for complex beats to generate Chironomie, and the presentation of the results of an evaluation experiment involving normal-hearing to assess the utility of Chironomie.

  • Best Presentation Award, The Tokai Chapter of Acoustical Society of Japan

    2023.12   The Tokai Chapter of Acoustical Society of Japan   A study on music visualization based on Chironomie

    Kana Tatsumi

    Award type:Award from Japanese society, conference, symposium, etc.  Country:Japan

  • 第27回東海地区音声関連研究室修士論文中間発表会総合3位

    2023.08   静岡大学   Chironomieに準ずる旋律線による音楽の可視化

    辰巳 花菜

    Award type:Award from Japanese society, conference, symposium, etc.  Country:Japan

  • Best Presentation Award, The Tokai Chapter of Acoustical Society of Japan

    2021.12   The Tokai Chapter of Acoustical Society of Japan   Dynamics Restoration for "Loud" Popular Music

    Hyuga Ozeki

    Award type:Award from Japanese society, conference, symposium, etc.  Country:Japan

  • Student Encouraging Award

    2021.09   Information Processing Society of Japan   Dynamics Restoration for "Loud" Popular Music

    Hyuga Ozeki, Shinji Sako

    Award type:Award from Japanese society, conference, symposium, etc.  Country:Japan

  • Japan Society for Fuzzy Theory and Intelligent Informatics Best Paper Award

    2017.09   Japan Society for Fuzzy Theory and Intelligent Informatics   Automatic Performance Rendering Method for Keyboard Instruments based on Statistical Model that Associates Performance Expression and Musical Notation

    Kenta Okumura, Shinji Sako, Tadashi Kitamura

    Award type:Honored in official journal of a scientific society, scientific journal  Country:Japan

    This paper proposes a method for the automatic rendition of performances without losing any characteristics of the specific performer. In many of existing methods, users are required to input expertise such as possessed by the performer. Although they are useful in support of users'own performances, they are not suitable for the purpose of this proposal. The proposed method defines a model that associates the feature quantities of expression extracted from the case of actual performance with its directions that can be surely retrieved from musical score without using expertise. By classifying expressive tendency of the expression of the model for each case of performance using the criteria based on score directions, the rules that elucidate the causal relationship between the performer's specific performance expression and the score directions systematically can be structured. The candidates of the performance cases corresponding to the unseen score directions is obtained by tracing this structure. Dynamic programming is applied to solve the problem of searching the sequence of performance cases with the optimal expression from among these candidates. Objective evaluations indicated that the proposed method is able to efficiently render optimal performances. From subjective evaluations, the quality of rendered expression by the proposed method was confirmed. It was also shown that the characteristics of the performer could be reproduced even in various compositions. Furthermore, performances rendered via the proposed method have won the first prize in the autonomous section of a performance rendering contest for computer systems.

  • IPSJ Yamashita SIG Research Award

    2016.03   Information Processing Society of Japan   A study of comparative analysis of music performances based on the statistical model that associates expression and notation

    Kenta Okumura, Shinji Sako, Tadashi Kitamura

    Award type:Award from Japanese society, conference, symposium, etc.  Country:Japan

  • 78th National Convention of IPSJ, Student Encouragement Award

    2016.03   Information Processing Society of Japan   A case-based approach to the melody transformation for automatic jazz arrangement

    Naoto Sato, Shinji Sako, Tadashi Kitamura

    Award type:Award from Japanese society, conference, symposium, etc.  Country:Japan

  • 学会活動貢献賞

    2014.03   日本音響学会東海支部  

    酒向 慎司

Scientific Research Funds Acquisition Results

  • 一人称視点映像を用いた手話対話の支援技術および記録技術基盤の構築

    Grant number:23K11197  2023.04 - 2026.03

    日本学術振興会  科学研究費補助金  基盤研究(C)

    酒向 慎司

    Authorship:Principal investigator  Grant type:Competitive


  • 自己教師あり学習手法による手話認識エンジンの開発

    Grant number:23747929  2023.04 - 2025.03

    日本学術振興会  科学研究費補助金  挑戦的萌芽研究

    木村 勉(研究代表)

    Authorship:Coinvestigator(s)  Grant type:Competitive

  • 手話コーパス,深層学習向けラベル付き手話データ半自動生成システムの開発

    Grant number:22H00661  2022.04 - 2026.03

    日本学術振興会  科学研究費補助金  基盤研究(B)

    木村 勉

    Authorship:Coinvestigator(s)  Grant type:Competitive


  • 音声認識手法を応用した自動作曲・自動作詞・自動伴奏の研究

    Grant number:21H03462  2021.04 - 2024.03

    日本学術振興会  科学研究費補助金  基盤研究(B)

    嵯峨山 茂樹

    Authorship:Coinvestigator(s)  Grant type:Competitive


  • 視覚障害者が能動的に白杖で叩くことによる音情報の作製と利用に関する基礎的研究

    Grant number:18K18698  2018.04 - 2022.03

    日本学術振興会  科学研究費補助金  挑戦的萌芽研究

    布川 清彦

    Authorship:Coinvestigator(s)  Grant type:Competitive

Past of Commissioned Research

  • 繊維産業に於けるAI自動検査システムの構築に関する研究開発

    2022.10 - 2025.03

    愛知県  知の拠点あいち重点研究プロジェクト プロジェクトDX  General Consignment Study 

    Authorship:Coinvestigator(s)  Grant type:Collaborative (industry/university)


  • 手話の自動翻訳を実現させる高精度な動作検出と動作のパターンマッチングの技術開発

    2016.10 - 2019.03

    経済産業省 戦略的基盤技術高度化支援事業(サポイン)  General Consignment Study 

    青井 基行

    Authorship:Coinvestigator(s)  Grant type:Competitive

  • 心地よく人間に合わせる自動演奏システムの研究

    2015.01 - 2015.12

    科学技術振興機構   研究成果最適展開支援事業(A-STEP)FSステージ  General Consignment Study 

    酒向 慎司

    Authorship:Principal investigator  Grant type:Competitive

    Grant amount:\2210000 ( Direct Cost: \1700000 、 Indirect Cost:\510000 )


  • 多様な利用形態に柔軟に対応する自動伴奏リハビリ支援システムの開発

    2013.08 - 2014.03

    科学技術振興機構   研究成果最適展開支援事業(A-STEP)FSステージ  General Consignment Study 

    酒向 慎司

    Authorship:Principal investigator  Grant type:Competitive

    Grant amount:\2210000 ( Direct Cost: \1700000 、 Indirect Cost:\510000 )


  • ユーザーの嗜好と利用シーンの変動に対応可能な統計モデルに基づいた楽曲からの感性推定モデルの研究

    2011.08 - 2012.03

    科学技術振興機構   研究成果最適展開支援事業(A-STEP)FSステージ  General Consignment Study 

    酒向 慎司

    Authorship:Principal investigator  Grant type:Competitive

    Grant amount:\2210000 ( Direct Cost: \1700000 、 Indirect Cost:\510000 )



Teaching Experience

  • 大学院工学研究科博士前期課程 数理情報特論

    2023.04 Institution:Nagoya Institute of Technology

  • 第二部電子情報工学科 計算機基礎

    2022.04 - 2024.03 Institution:Nagoya Institute of Technology

    Level:Undergraduate (specialized) 

  • 情報学専攻研究科共通科目 情報資源総論

    2021.06 Institution:Shizuoka University

    Level:Graduate (liberal arts) 

  • 大学院工学研究科博士前期課程 数理情報特論

     More details


  • 第二部電子情報工学科 計算機工学

    2018.04 - 2021.03 Institution:Nagoya Institute of Technology

    Level:Undergraduate (specialized) 

Committee Memberships

  • 情報処理学会   FIT2024 第24回情報科学技術フォーラム研究会担当委員・プログラム委員  

    2023.11 - 2024.09   

    Committee type:Academic society

  • 日本音響学会   第150回秋季研究発表会実行委員  

    2023.05 - 2023.09   

    Committee type:Academic society

  • The Institute of Electronics, Information and Communication Engineers   IEICE Well-being Information Technology (WIT), Vice Chairman  


    Committee type:Academic society

  • 電子情報通信学会   ヒューマンコミュニケーションシンポジウム2023運営委員  


    Committee type:Academic society

  • 電子情報通信学会   FIT2023 第22回情報科学技術フォーラム研究会担当委員・プログラム委員  

    2023.01 - 2023.09   

    Committee type:Academic society

  • 一般社団法人手話言語等の多文化共生社会協議会   代議員  


    Committee type:Other

  • 電子情報通信学会   ヒューマンコミュニケーションシンポジウム2022運営委員  

    2022.10 - 2022.12   

    Committee type:Academic society

  • Information Processing Society of Japan   IPSJ Special Interest Group on Music and Computer  


    Committee type:Academic society

  • 電子情報通信学会   FIT2022 第21回情報科学技術フォーラム研究会担当委員・プログラム委員  

    2022.01 - 2022.09   

    Committee type:Academic society

  • The Institute of Electronics, Information and Communication Engineers   Language as Real-time Communication, Chairman  


    Committee type:Academic society

Social Activities

  • 生産現場での動作音の異常検知・予知技術開発

    Role(s): Lecturer

    尾張繊維技術センター  オンライン (Zoom)  2022.10

     More details

    Type:Visiting lecture