Juliusのサーバインストールと動作テスト

Juliusをインストールする

いずれモジュールモードで動かしてみるつもりだ。

$ git clone https://github.com/julius-speech/julius

ファイルがgithubからローカルにコピーされる。

$ cd julius
$ ./configure

でコマンドラインが流れた後、こんな表示が出る。

****************************************************************
Julius/Julian libsent library rev.4.4.2.1:

- Audio I/O
    primary mic device API   : oss (Open Sound System compatible)
    available mic device API : oss
    supported audio format   : RAW and WAV only
    NetAudio support         : no
- Language Modeling
    class N-gram support     : yes
- Libraries
    file decompression by    : zlib library
- Process management
    fork on adinnet input    : no

  Note: compilation time flags are now stored in "libsent-config".
        If you link this library, please add output of
        "libsent-config --cflags" to CFLAGS and
        "libsent-config --libs" to LIBS.
****************************************************************

じゃ、先に進みますか。

$ make
$ sudo make install

なんだか、一瞬で作業終了。
警告も出ていないのでインストールされた、、、のかな??

文法認識キットをインストール。

その前に、gitから大型ファイルをダウンロードするためのツールgit lfsをインストールしろというご命令が。。。
種々のサイトを参考にしてみたけれど、Ubuntu17でパッケージインストール成功報告がない。
で、とりあえずコマンドを叩いてみた。

$ sudo apt install git-lfs
[sudo] cooloctober のパスワード:
パッケージリストを読み込んでいます... 完了
依存関係ツリーを作成しています               
状態情報を読み取っています... 完了
以下のパッケージが新たにインストールされます:
  git-lfs
アップグレード: 0 個、新規インストール: 1 個、削除: 0 個、保留: 0 個。

…… 以下略 ……

なんかしれっとインストールできたっぽい。
ありがたや。

で、あらためて文法認識キットを導入してみる。

$ git lfs clone https://github.com/julius-speech/grammar-kit

導入成功。
そんなにでかいファイルじゃない感じもするしで、git lfsが必要だったのかよく分からん。

テストしてみよう。
サーバに置いてあるので、直接マイクを使えない環境。
テスト音声ファイルを使って動作テストを行う。

$ cd grammar-kit
$ julius -C testmic.jconf -nostrip -charconv SJIS UTF-8 -input rawfile

STAT: include config: testmic.jconf
STAT: include config: hmm_ptm.jconf
STAT: jconf successfully finalized
STAT: *** loading AM00 _default
Stat: init_phmm: Reading in HMM definition
Stat: read_binhmm: binary format HMM definition
Stat: read_binhmm: this HMM does not need multipath handling
Stat: init_phmm: defined HMMs:  7946
Stat: init_phmm: loading ascii hmmlist
Stat: init_phmm: logical names: 21424 in HMMList
Stat: init_phmm: base phones:    43 used in logical
Stat: init_phmm: finished reading HMM definitions
STAT: making pseudo bi/mono-phone for IW-triphone
Stat: hmm_lookup: 10 pseudo phones are added to logical HMM list
STAT: *** AM00 _default loaded
STAT: *** loading LM00 _default
STAT: reading [SampleGrammars/fruit/fruit.dfa] and [SampleGrammars/fruit/fruit.dict]...
Stat: init_voca: read 20 words
STAT: done
STAT: Gram #0 fruit registered
STAT: Gram #0 fruit: new grammar loaded, now mash it up for recognition
STAT: Gram #0 fruit: extracting category-pair constraint for the 1st pass
STAT: Gram #0 fruit: installed
STAT: Gram #0 fruit: turn on active
STAT: grammar update completed
STAT: *** LM00 _default loaded
STAT: ------
STAT: All models are ready, go for final fusion
STAT: [1] create MFCC extraction instance(s)
STAT: *** create MFCC calculation modules from AM
STAT: AM 0 _default: create a new module MFCC01
STAT: 1 MFCC modules created
STAT: [2] create recognition processing instance(s) with AM and LM
STAT: composing recognizer instance SR00 _default (AM00 _default, LM00 _default)
STAT: Building HMM lexicon tree
STAT: lexicon size: 201+0=201
STAT: coordination check passed
STAT: multi-gram: beam width set to 200 (guess) by lexicon change
STAT: wchmm (re)build completed
STAT: SR00 _default composed
STAT: [3] initialize for acoustic HMM calculation
Stat: outprob_init: all mixture PDFs are tied-mixture, use calc_tied_mix()
Stat: addlog: generating addlog table (size = 1953 kB)
Stat: addlog: addlog table generated
STAT: [4] prepare MFCC storage(s)
STAT: All init successfully done

STAT: ###### initialize input device
----------------------- System Information begin ---------------------
JuliusLib rev.4.4.2.1 (fast)

Engine specification:
 -  Base setup   : fast
 -  Supported LM : DFA, N-gram, Word
 -  Extension    :
 -  Compiled by  : gcc -O6 -fomit-frame-pointer
Library configuration: version 4.4.2.1
 - Audio input
    primary A/D-in driver   : oss (Open Sound System compatible)
    available drivers       : oss
    wavefile formats        : RAW and WAV only
    max. length of an input : 320000 samples, 150 words
 - Language Model
    class N-gram support    : yes
    MBR weight support      : yes
    word id unit            : short (2 bytes)
 - Acoustic Model
    multi-path treatment    : autodetect
 - External library
    file decompression by   : zlib library
 - Process hangling
    fork on adinnet input   : no
 - built-in SIMD instruction set for DNN
    SSE AVX FMA
    AVX is available maximum on this cpu, use it


------------------------------------------------------------
Configuration of Modules

 Number of defined modules: AM=1, LM=1, SR=1

 Acoustic Model (with input parameter spec.):
 - AM00 "_default"
    hmmfilename=model/phone_m/hmmdefs_ptm_gid.binhmm
    hmmmapfilename=model/phone_m/logicalTri

 Language Model:
 - LM00 "_default"
    grammar #1:
        dfa  = SampleGrammars/fruit/fruit.dfa
        dict = SampleGrammars/fruit/fruit.dict

 Recognizer:
 - SR00 "_default" (AM00, LM00)

------------------------------------------------------------
Speech Analysis Module(s)

[MFCC01]  for [AM00 _default]

 Acoustic analysis condition:
           parameter = MFCC_E_D_N_Z (25 dim. from 12 cepstrum + energy, abs energy supressed with CMN)
    sample frequency = 16000 Hz
       sample period =  625  (1 = 100ns)
         window size =  400 samples (25.0 ms)
         frame shift =  160 samples (10.0 ms)
        pre-emphasis = 0.97
        # filterbank = 24
       cepst. lifter = 22
          raw energy = False
    energy normalize = False
        delta window = 2 frames (20.0 ms) around
         hi freq cut = OFF
         lo freq cut = OFF
     zero mean frame = OFF
           use power = OFF
                 CVN = OFF
                VTLN = OFF

    spectral subtraction = off

 cep. mean normalization = yes, with per-utterance self mean
 cep. var. normalization = no

     base setup from = Julius defaults

------------------------------------------------------------
Acoustic Model(s)

[AM00 "_default"]

 HMM Info:
    7946 models, 3131 states, 3131 mpdfs, 8256 Gaussians are defined
          model type = has tied-mixture, context dependency handling ON
      training parameter = MFCC_E_N_D_Z
       vector length = 25
    number of stream = 1
         stream info = [0-24]
    cov. matrix type = DIAGC
       duration type = NULLD
        codebook num = 129
       max codebook size = 64
    max mixture size = 64 Gaussians
     max length of model = 5 states
     logical base phones = 43
       model skip trans. = not exist, no multi-path handling

 AM Parameters:
        Gaussian pruning = beam  (-gprune)
  top N mixtures to calc = 2 / 64  (-tmix)
    short pause HMM name = "sp" specified, "sp" applied (physical)  (-sp)
  cross-word CD on pass1 = handle by approx. (use average prob. of same LC)

------------------------------------------------------------
Language Model(s)

[LM00 "_default"] type=grammar

 DFA grammar info:
      8 nodes, 10 arcs, 9 terminal(category) symbols
      category-pair matrix: 60 bytes (1024 bytes allocated)

 Vocabulary Info:
        vocabulary size  = 20 words, 67 models
        average word len = 3.3 models, 10.1 states
       maximum state num = 21 nodes per word
       transparent words = not exist
       words under class = not exist

 Parameters:
   found sp category IDs =

------------------------------------------------------------
Recognizer(s)

[SR00 "_default"]  AM00 "_default"  +  LM00 "_default"

 Lexicon tree:
     total node num =    201
      root node num =     20
      leaf node num =     20

    (-penalty1) IW penalty1 = +0.0
    (-penalty2) IW penalty2 = +0.0
    (-cmalpha)CM alpha coef = 0.050000

 Search parameters:
        multi-path handling = no
    (-b) trellis beam width = 200 (-1 or not specified - guessed)
    (-bs)score pruning thres= disabled
    (-n)search candidate num= 1
    (-s)  search stack size = 500
    (-m)    search overflow = after 2000 hypothesis poped
            2nd pass method = searching sentence, generating N-best
    (-b2)  pass2 beam width = 30
    (-lookuprange)lookup range= 5  (tm-5 <= t     (-sb)2nd scan beamthres = 80.0 (in logscore)
    (-n)        search till = 1 candidates found
    (-output)    and output = 1 candidates out of above
     IWCD handling:
       1st pass: approximation (use average prob. of same LC)
       2nd pass: loose (apply when hypo. is popped and scanned)
     all possible words will be expanded in 2nd pass
     build_wchmm2() used
     lcdset limited by word-pair constraint
    progressive output on 1st pass
    short pause segmentation = off
            progout interval = 300 msec
    fall back on search fail = off, returns search failure

------------------------------------------------------------
Decoding algorithm:

    1st pass input processing = buffered, batch
    1st pass method = 1-best approx. generating indexed trellis
    output word confidence measure based on search-time scores

------------------------------------------------------------
FrontEnd:

 Input stream:
                 input type = waveform
               input source = waveform file
              input filelist = (none, get file name from stdin)
              sampling freq. = 16000 Hz required
             threaded A/D-in = supported, off
       zero frames stripping = off
             silence cutting = off
        long-term DC removal = off
        level scaling factor = 1.00 (disabled)
          reject short input = off
          reject  long input = off

----------------------- System Information end -----------------------

Notice for feature extraction (01),
    *************************************************************
    * Cepstral mean normalization for batch decoding:           *
    * per-utterance mean will be computed and applied.          *
    *************************************************************

enter filename->sample.wav
Stat: adin_file: input speechfile: sample.wav
STAT: 33984 samples (2.12 sec.)
STAT: ### speech analysis (waveform -> MFCC)
pass1_best: リンゴ 3 個 を ください           
sentence1: リンゴ 3 個 を ください


認識できているみたいです。

0 件のコメント:

コメントを投稿