Juliusをインストールする
いずれモジュールモードで動かしてみるつもりだ。
$ git clone https://github.com/julius-speech/julius
ファイルがgithubからローカルにコピーされる。
$ cd julius
$ ./configure
でコマンドラインが流れた後、こんな表示が出る。
****************************************************************
Julius/Julian libsent library rev.4.4.2.1:
- Audio I/O
primary mic device API : oss (Open Sound System compatible)
available mic device API : oss
supported audio format : RAW and WAV only
NetAudio support : no
- Language Modeling
class N-gram support : yes
- Libraries
file decompression by : zlib library
- Process management
fork on adinnet input : no
Note: compilation time flags are now stored in "libsent-config".
If you link this library, please add output of
"libsent-config --cflags" to CFLAGS and
"libsent-config --libs" to LIBS.
****************************************************************
じゃ、先に進みますか。
$ make
$ sudo make install
なんだか、一瞬で作業終了。
警告も出ていないのでインストールされた、、、のかな??
文法認識キットをインストール。
その前に、gitから大型ファイルをダウンロードするためのツールgit lfsをインストールしろというご命令が。。。
種々のサイトを参考にしてみたけれど、Ubuntu17でパッケージインストール成功報告がない。
で、とりあえずコマンドを叩いてみた。
$ sudo apt install git-lfs
[sudo] cooloctober のパスワード:
パッケージリストを読み込んでいます... 完了
依存関係ツリーを作成しています
状態情報を読み取っています... 完了
以下のパッケージが新たにインストールされます:
git-lfs
アップグレード: 0 個、新規インストール: 1 個、削除: 0 個、保留: 0 個。
…… 以下略 ……
なんかしれっとインストールできたっぽい。
ありがたや。
で、あらためて文法認識キットを導入してみる。
$ git lfs clone https://github.com/julius-speech/grammar-kit
導入成功。
そんなにでかいファイルじゃない感じもするしで、git lfsが必要だったのかよく分からん。
テストしてみよう。
サーバに置いてあるので、直接マイクを使えない環境。
テスト音声ファイルを使って動作テストを行う。
$ cd grammar-kit
$ julius -C testmic.jconf -nostrip -charconv SJIS UTF-8 -input rawfile
STAT: include config: testmic.jconf
STAT: include config: hmm_ptm.jconf
STAT: jconf successfully finalized
STAT: *** loading AM00 _default
Stat: init_phmm: Reading in HMM definition
Stat: read_binhmm: binary format HMM definition
Stat: read_binhmm: this HMM does not need multipath handling
Stat: init_phmm: defined HMMs: 7946
Stat: init_phmm: loading ascii hmmlist
Stat: init_phmm: logical names: 21424 in HMMList
Stat: init_phmm: base phones: 43 used in logical
Stat: init_phmm: finished reading HMM definitions
STAT: making pseudo bi/mono-phone for IW-triphone
Stat: hmm_lookup: 10 pseudo phones are added to logical HMM list
STAT: *** AM00 _default loaded
STAT: *** loading LM00 _default
STAT: reading [SampleGrammars/fruit/fruit.dfa] and [SampleGrammars/fruit/fruit.dict]...
Stat: init_voca: read 20 words
STAT: done
STAT: Gram #0 fruit registered
STAT: Gram #0 fruit: new grammar loaded, now mash it up for recognition
STAT: Gram #0 fruit: extracting category-pair constraint for the 1st pass
STAT: Gram #0 fruit: installed
STAT: Gram #0 fruit: turn on active
STAT: grammar update completed
STAT: *** LM00 _default loaded
STAT: ------
STAT: All models are ready, go for final fusion
STAT: [1] create MFCC extraction instance(s)
STAT: *** create MFCC calculation modules from AM
STAT: AM 0 _default: create a new module MFCC01
STAT: 1 MFCC modules created
STAT: [2] create recognition processing instance(s) with AM and LM
STAT: composing recognizer instance SR00 _default (AM00 _default, LM00 _default)
STAT: Building HMM lexicon tree
STAT: lexicon size: 201+0=201
STAT: coordination check passed
STAT: multi-gram: beam width set to 200 (guess) by lexicon change
STAT: wchmm (re)build completed
STAT: SR00 _default composed
STAT: [3] initialize for acoustic HMM calculation
Stat: outprob_init: all mixture PDFs are tied-mixture, use calc_tied_mix()
Stat: addlog: generating addlog table (size = 1953 kB)
Stat: addlog: addlog table generated
STAT: [4] prepare MFCC storage(s)
STAT: All init successfully done
STAT: ###### initialize input device
----------------------- System Information begin ---------------------
JuliusLib rev.4.4.2.1 (fast)
Engine specification:
- Base setup : fast
- Supported LM : DFA, N-gram, Word
- Extension :
- Compiled by : gcc -O6 -fomit-frame-pointer
Library configuration: version 4.4.2.1
- Audio input
primary A/D-in driver : oss (Open Sound System compatible)
available drivers : oss
wavefile formats : RAW and WAV only
max. length of an input : 320000 samples, 150 words
- Language Model
class N-gram support : yes
MBR weight support : yes
word id unit : short (2 bytes)
- Acoustic Model
multi-path treatment : autodetect
- External library
file decompression by : zlib library
- Process hangling
fork on adinnet input : no
- built-in SIMD instruction set for DNN
SSE AVX FMA
AVX is available maximum on this cpu, use it
------------------------------------------------------------
Configuration of Modules
Number of defined modules: AM=1, LM=1, SR=1
Acoustic Model (with input parameter spec.):
- AM00 "_default"
hmmfilename=model/phone_m/hmmdefs_ptm_gid.binhmm
hmmmapfilename=model/phone_m/logicalTri
Language Model:
- LM00 "_default"
grammar #1:
dfa = SampleGrammars/fruit/fruit.dfa
dict = SampleGrammars/fruit/fruit.dict
Recognizer:
- SR00 "_default" (AM00, LM00)
------------------------------------------------------------
Speech Analysis Module(s)
[MFCC01] for [AM00 _default]
Acoustic analysis condition:
parameter = MFCC_E_D_N_Z (25 dim. from 12 cepstrum + energy, abs energy supressed with CMN)
sample frequency = 16000 Hz
sample period = 625 (1 = 100ns)
window size = 400 samples (25.0 ms)
frame shift = 160 samples (10.0 ms)
pre-emphasis = 0.97
# filterbank = 24
cepst. lifter = 22
raw energy = False
energy normalize = False
delta window = 2 frames (20.0 ms) around
hi freq cut = OFF
lo freq cut = OFF
zero mean frame = OFF
use power = OFF
CVN = OFF
VTLN = OFF
spectral subtraction = off
cep. mean normalization = yes, with per-utterance self mean
cep. var. normalization = no
base setup from = Julius defaults
------------------------------------------------------------
Acoustic Model(s)
[AM00 "_default"]
HMM Info:
7946 models, 3131 states, 3131 mpdfs, 8256 Gaussians are defined
model type = has tied-mixture, context dependency handling ON
training parameter = MFCC_E_N_D_Z
vector length = 25
number of stream = 1
stream info = [0-24]
cov. matrix type = DIAGC
duration type = NULLD
codebook num = 129
max codebook size = 64
max mixture size = 64 Gaussians
max length of model = 5 states
logical base phones = 43
model skip trans. = not exist, no multi-path handling
AM Parameters:
Gaussian pruning = beam (-gprune)
top N mixtures to calc = 2 / 64 (-tmix)
short pause HMM name = "sp" specified, "sp" applied (physical) (-sp)
cross-word CD on pass1 = handle by approx. (use average prob. of same LC)
------------------------------------------------------------
Language Model(s)
[LM00 "_default"] type=grammar
DFA grammar info:
8 nodes, 10 arcs, 9 terminal(category) symbols
category-pair matrix: 60 bytes (1024 bytes allocated)
Vocabulary Info:
vocabulary size = 20 words, 67 models
average word len = 3.3 models, 10.1 states
maximum state num = 21 nodes per word
transparent words = not exist
words under class = not exist
Parameters:
found sp category IDs =
------------------------------------------------------------
Recognizer(s)
[SR00 "_default"] AM00 "_default" + LM00 "_default"
Lexicon tree:
total node num = 201
root node num = 20
leaf node num = 20
(-penalty1) IW penalty1 = +0.0
(-penalty2) IW penalty2 = +0.0
(-cmalpha)CM alpha coef = 0.050000
Search parameters:
multi-path handling = no
(-b) trellis beam width = 200 (-1 or not specified - guessed)
(-bs)score pruning thres= disabled
(-n)search candidate num= 1
(-s) search stack size = 500
(-m) search overflow = after 2000 hypothesis poped
2nd pass method = searching sentence, generating N-best
(-b2) pass2 beam width = 30
(-lookuprange)lookup range= 5 (tm-5 <= t
(-sb)2nd scan beamthres = 80.0 (in logscore)
(-n) search till = 1 candidates found
(-output) and output = 1 candidates out of above
IWCD handling:
1st pass: approximation (use average prob. of same LC)
2nd pass: loose (apply when hypo. is popped and scanned)
all possible words will be expanded in 2nd pass
build_wchmm2() used
lcdset limited by word-pair constraint
progressive output on 1st pass
short pause segmentation = off
progout interval = 300 msec
fall back on search fail = off, returns search failure
------------------------------------------------------------
Decoding algorithm:
1st pass input processing = buffered, batch
1st pass method = 1-best approx. generating indexed trellis
output word confidence measure based on search-time scores
------------------------------------------------------------
FrontEnd:
Input stream:
input type = waveform
input source = waveform file
input filelist = (none, get file name from stdin)
sampling freq. = 16000 Hz required
threaded A/D-in = supported, off
zero frames stripping = off
silence cutting = off
long-term DC removal = off
level scaling factor = 1.00 (disabled)
reject short input = off
reject long input = off
----------------------- System Information end -----------------------
Notice for feature extraction (01),
*************************************************************
* Cepstral mean normalization for batch decoding: *
* per-utterance mean will be computed and applied. *
*************************************************************
enter filename->sample.wav
Stat: adin_file: input speechfile: sample.wav
STAT: 33984 samples (2.12 sec.)
STAT: ### speech analysis (waveform -> MFCC)
pass1_best: リンゴ 3 個 を ください
sentence1: リンゴ 3 個 を ください
認識できているみたいです。