Abstract

Voice recognition has become an integral part of our lives, commonly used in call centers and as part of virtual assistants. However, voice recognition is increasingly applied to more industrial uses. Each of these use cases has unique characteristics that may impact the effectiveness of voice recognition, which could impact industrial productivity, performance, or even safety. One of the most prominent among them is the unique background noises that are dominant in each industry. The existence of different machinery and different work layouts are primary contributors to this. Another important characteristic is the type of communication that is present in these settings. Daily communication often involves longer sentences uttered under relatively silent conditions, whereas communication in industrial settings is often short and conducted in loud conditions. In this study, we demonstrated the importance of taking these two elements into account by comparing the performances of two voice recognition algorithms under several background noise conditions: a regular Convolutional Neural Network (CNN)-based voice recognition algorithm to an Auto Speech Recognition (ASR)-based model with a denoising module. Our results indicate that there is a significant performance drop between the typical background noise use (white noise) and the rest of the background noises. Also, our custom ASR model with the denoising module outperformed the CNN-based model with an overall performance increase between 14–35% across all background noises. Both results give proof that specialized voice recognition algorithms need to be developed for these environments to reliably deploy them as control mechanisms.

References

1.
Uddin
,
M. M.
,
Huynh
,
N.
,
Vidal
,
J. M.
,
Taaffe
,
K. M.
,
Fredendall
,
L. D.
, and
Greenstein
,
J. S.
,
2015
, “
Evaluation of Google’s Voice Recognition and Sentence Classification for Health Care Applications
,”
Eng. Manage. J.
,
27
(
3
), pp.
152
162
.
2.
Cevher
,
D.
,
Zepf
,
S.
, and
Klinger
,
R.
,
2019
, “
Towards Multimodal Emotion Recognition in German Speech Events in Cars Using Transfer Learning
.”
arXiv preprint arXiv:1909.02764
.
3.
Mittal
,
Y.
,
Toshniwal
,
P.
,
Sharma
,
S.
,
Singhal
,
D.
,
Gupta
,
R.
, and
Mittal
,
V. K.
,
2015
, “
A Voice-Controlled Multi-Functional Smart Home Automation System
,”
Proceedings of the 2015 Annual IEEE India Conference (INDICON)
,
IEEE
, pp.
1
6
.
4.
Meticulous Market Research
,
2019
, “
Speech and Voice Recognition Market by Type (SPEECH and Voice Recognition), End User (Automotive, Healthcare, BFSI, EDUCATION, Legal), Technology (Artificial Intelligence and NON-ARTIFICIAL Intelligence), and Geography—Global Forecast to 2025
.”
Speech and Voice Recognition Market | Meticulous Market Research Pvt. Ltd
. https://www.meticulousresearch.com/product/speech-and-voice-recognition-market-5038
5.
Rogowski
,
A.
,
2012
, “
Industrially Oriented Voice Control System
,”
Robot. Comput.-Integr. Manuf.
,
28
(
3
), pp.
303
315
.
6.
Brauer
,
R. L.
,
2016
,
Safety and Health for Engineers
,
John Wiley & Sons
,
Hoboken, NJ
.
7.
Tilley
,
J.
,
2020
, “
Automation, Robotics, and the Factory of the Future
,”
McKinsey & Company
, https://www.mckinsey.com/business-functions/operations/our-insights/automation-robotics-and-the-factory-of-the-future, Accessed October 20.
8.
Longo
,
F.
,
Nicoletti
,
L.
, and
Padovano
,
A.
,
2017
, “
Smart Operators in Industry 4.0: A Human-Centered Approach to Enhance Operators’ Capabilities and Competencies Within the New Smart Factory Context
,”
Comput. Ind. Eng.
,
113
, pp.
144
159
.
9.
Cohen
,
P. R.
, and
Oviatt
,
S. L.
,
1995
, “
The Role of Voice Input for Human-Machine Communication
,”
Proc. Natl. Acad. Sci. U. S. A.
,
92
(
22
), pp.
9921
9927
.
10.
Longo
,
F.
, and
Padovano
,
A.
,
2020
, “
Voice-Enabled Assistants of the Operator 4.0 in the Social Smart Factory: Prospective Role and Challenges for an Advanced Human–Machine Interaction
,”
Manuf. Lett.
,
26
, pp.
12
16
.
11.
Rains
,
G. C.
,
2014
, “
Emergency Tractor Shut-Off Using a Voice Command System
,”
2014 Montreal, Quebec Canada
,
American Society of Agricultural and Biological Engineers
,
July 13–16
, p.
1
.
12.
Valenzuela
,
V. E.
,
Lauria
,
V. F.
,
Lucena
,
P. P.
,
Jazdi
,
N.
, and
Göhner
,
P.
,
2013
, “
Voice-Activated System to Remotely Control Industrial and Building Automation Systems Using Cloud Computing
,”
Proceedings of the 2013 IEEE 18th Conference on Emerging Technologies & Factory Automation (ETFA)
,
Cagliari, Italy
,
Sept. 10–13
,
IEEE
, pp.
1
4
.
13.
Solorio
,
J. A.
,
Garcia-Bravo
,
J. M.
, and
Newell
,
B. A.
,
2018
, “
Voice Activated Semi-autonomous Vehicle Using Off the Shelf Home Automation Hardware
,”
IEEE Internet Things J.
,
5
(
6
), pp.
5046
5054
.
14.
Pleva
,
M.
,
Juhar
,
J.
,
Ondas
,
S.
,
Hudson
,
C. R.
,
Bethel
,
C. L.
, and
Carruth
,
D. W.
,
2019
, “
Novice User Experiences With a Voice-Enabled Human–Robot Interaction Tool
,”
Proceedings of the 2019 29th International Conference Radioelektronika (Radioelektronika)
,
Pardubice, Czech Republic
,
Apr. 16–18
,
IEEE
, pp.
1
5
.
15.
Lee
,
S. J.
,
Kang
,
B. O.
,
Jung
,
H. Y.
,
Lee
,
Y.
, and
Kim
,
H. S.
,
2010
, “
Statistical Model-Based Noise Reduction Approach for Car Interior Applications to Speech Recognition
,”
ETRI J.
,
32
(
5
), pp.
801
809
.
16.
Sokol
,
N.
,
Chen
,
E. Y.
, and
Donmez
,
B.
,
2017
, “
Voice-Controlled In-Vehicle Systems: Effects of Voice-Recognition Accuracy in the Presence of Background Noise
,”
Proceedings of the 9th International Driving Symposium on Human Factors in Driver Assessment, Training, and Vehicle Design: Driving Assessment 2017
,
Manchester Village, VT
,
June 26–29
, pp.
158
164
.
17.
Czap
,
L.
, and
Pinter
,
J. M.
,
2018
, “Noise Reduction for Voice-Activated Car Commands,”
Vehicle and Automotive Engineering
,
Springer
,
Cham
, pp.
351
358
.
18.
Tamoto
,
A.
, and
Itou
,
K.
,
2019
, “
Voice Authentication by Text Dependent Single Utterance for In-Car Environment
,”
Proceedings of the Tenth International Symposium on Information and Communication Technology
,
Hanoi and Halong, Vietnam
,
Dec. 4–6
, pp.
336
341
.
19.
Sachdev
,
S.
,
Macwan
,
J.
,
Patel
,
C.
, and
Doshi
,
N.
,
2019
, “
Voice-Controlled Autonomous Vehicle Using IoT
,”
Procedia Comput. Sci.
,
160
, pp.
712
717
.
20.
Susanto
,
D.
,
Mujaahid
,
F.
,
Syahputra
,
R.
, and
Putra
,
K. T.
,
2019
, “
Open Source System for Smart Home Devices Based on Smartphone Virtual Assistant
,”
J. Electr. Eng. UMY
,
3
(
1
), pp.
1
7
.
21.
Orlandic
,
L.
,
Teijeiro
,
T.
, and
Atienza
,
D.
,
2021
, “
The COUGHVID Crowdsourcing Dataset, a Corpus for the Study of Large-Scale Cough Analysis Algorithms
,”
Sci. Data
,
8
(
1
), pp.
1
10
.
22.
Davis
,
G.
,
1978
, “
Noise and Vibration Hazards in Chainsaw Operations: A Review
,”
Aust. For.
,
41
(
3
), pp.
153
159
.
23.
Ghai
,
W.
, and
Singh
,
N.
,
2012
, “
Literature Review on Automatic Speech Recognition
,”
Int. J. Comput. Appl.
,
41
(
8
), pp.
42
50
.
24.
Deng
,
L.
, and
Li
,
X.
,
2013
, “
“Machine Learning Paradigms for Speech Recognition: An Overview
,”
IEEE/ACM Trans. Audio Speech Lang. Process.
,
21
(
5
), pp.
1060
1089
.
25.
Ouisaadane
,
A.
,
Safi
,
S.
, and
Frikel
,
M.
,
2019
, “
English Spoken Digits Database Under Noise Conditions for Research: SDDN
,”
Proceedings of the 2019 International Conference on Wireless Technologies, Embedded and Intelligent Systems (WITS)
,
Fez, Morocco
,
Apr. 3–5
,
IEEE
, pp.
1
5
.
26.
Bach
,
J.-H.
,
Kollmeier
,
B.
, and
Anemüller
,
J.
,
2010
, “
Modulation-Based Detection of Speech in Real Background Noise: Generalization to Novel Background Classes
,”
Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing
,
Dallas, TX
,
Mar. 15–19
,
IEEE
, pp.
41
44
.
27.
Xu
,
Y.
,
Du
,
J.
,
Dai
,
L.-R.
, and
Lee
,
C.-H.
,
2014
, “
Dynamic Noise Aware Training for Speech Enhancement Based on Deep Neural Networks
,”
Proceedings of the Fifteenth Annual Conference of the International Speech Communication Association
,
Singapore
,
Sept. 14–18
, pp.
2670
2674
.
28.
Krishna
,
G.
,
Tran
,
C.
,
Yu
,
J.
, and
Tewfik
,
A. H.
,
2019
, “
Speech Recognition with no Speech or with Noisy Speech
,”
Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
,
Brighton, UK
,
May 12–17
,
IEEE
, pp.
1090
1094
.
29.
Chan
,
W.
,
Jaitly
,
N.
,
Le
,
Q.
, and
Vinyals
,
O.
,
2016
, “
Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition
,”
Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
,
Shanghai, China
,
Mar. 20–25
,
IEEE
, pp.
4960
4964
.
30.
Sutskever
,
I.
,
Vinyals
,
O.
, and
Le
,
Q. V.
,
2014
, “
Sequence to Sequence Learning with Neural Networks
,”
Advances in Neural Information Processing Systems
,
Montreal, Canada
,
Dec. 8–13
, pp.
3104
3112
.
31.
Cho
,
K.
,
Van Merriënboer
,
B.
,
Gulcehre
,
C.
,
Bahdanau
,
D.
,
Bougares
,
F.
,
Schwenk
,
H.
, and
Bengio
,
Y.
,
2014
, “
Learning Phrase Representations Using RNN Encoder-Decoder For Statistical Machine Translation
,”
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, Doha, Qatar, pp.
1724
1734
.
32.
Mozilla
,
2017
, “
Common Voice by Mozilla
.”
Common Voice
. https://voice.mozilla.org/en
33.
Warden
,
P.
,
2018
, “
Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition
.”
arXiv preprint arXiv:1804.03209
.
34.
Reddy
,
C. K.
,
Beyrami
,
E.
,
Pool
,
J.
,
Cutler
,
R.
,
Srinivasan
,
S.
, and
Gehrke
,
J.
,
2019
, “
A Scalable Noisy Speech Dataset and Online Subjective Test Framework
,”
Interspeech
.
35.
Flamme
,
G. A.
,
Stephenson
,
M. R.
,
Deiters
,
K.
,
Tatro
,
A.
,
Van Gessel
,
D.
,
Geda
,
K.
,
Wyllys
,
K.
, and
McGregor
,
K.
,
2012
, “
Typical Noise Exposure in Daily Life
,”
Int. J. Audiol.
,
51
(
1
), pp.
S3
S11
.
36.
Birch
,
B.
,
Griffiths
,
C. A.
, and
Morgan
,
A.
,
2021
, “
Environmental Effects on Reliability and Accuracy of MFCC Based Voice Recognition for Industrial Human-Robot-Interaction
,”
Proc. Inst. Mech. Eng. B: J. Eng. Manuf.
,
235
.
37.
Bingol
,
M. C.
, and
Aydogmus
,
O.
,
2020
, “
Performing Predefined Tasks Using the Human–Robot Interaction on Speech Recognition for an Industrial Robot
,”
Eng. Appl. Artif. Intell.
,
95
, p.
103903
.
38.
Valin
,
J.-M.
,
2018
, “
A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement
,”
Proceedings of the 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP)
,
Vancouver Canada
,
Aug. 29–31
,
IEEE
, pp.
1
5
.
39.
Rethage
,
D.
,
Pons
,
J.
, and
Serra
,
X.
,
2018
, “
A Wavenet for Speech Denoising
,”
Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
,
Calgary, Canada
,
Apr. 15–20
,
IEEE
, pp.
5069
5073
.
40.
Pascual
,
S.
,
Bonafonte
,
A.
, and
Serra
,
J.
,
2017
, “
SEGAN: Speech Enhancement Generative Adversarial Network
.”
arXiv preprint arXiv:1703.09452
.
41.
Rabinowitz
,
P. M.
,
Galusha
,
D.
,
Dixon-Ernst
,
C.
,
Slade
,
M. D.
, and
Cullen
,
M. R.
,
2007
, “
Do Ambient Noise Exposure Levels Predict Hearing Loss in a Modern Industrial Cohort?
,”
Occup. Environ. Med.
,
64
(
1
), pp.
53
59
.
42.
NIOSH
,
2019
,
Overall Statistics—All U.S. Industries—Ohl
.”
Centers for Disease Control and Prevention
, August 27. https://www.cdc.gov/niosh/topics/ohl/overall.html
43.
Bailey
,
H.
,
Senior
,
B.
,
Simmons
,
D.
,
Rusin
,
J.
,
Picken
,
G.
, and
Thompson
,
P. M.
,
2010
, “
Assessing Underwater Noise Levels During Pile-Driving at an Offshore Windfarm and its Potential Effects on Marine Mammals
,”
Mar. Pollut. Bull.
,
60
(
6
), pp.
888
897
.
44.
Fleming
,
K.
,
Weltman
,
A.
,
Randolph
,
M.
, and
Elson
,
K.
,
2008
,
Piling Engineering
,
CRC Press
.
45.
Leroy
,
D.
,
Coucke
,
A.
,
Lavril
,
T.
,
Gisselbrecht
,
T.
, and
Dureau
,
J.
,
2019
, “
Federated Learning for Keyword Spotting
,”
Proceedings of the InICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
,
Brighton, UK
,
May 12–17
,
IEEE
,
May 12, 2019
, pp.
6341
6345
.
46.
Lugosch
,
L.
,
Ravanelli
,
M.
,
Ignoto
,
P.
,
Tomar
,
V. S.
, and
Bengio
,
Y.
,
2019
,
Speech Model Pre-Training for End-to-End Spoken Language Understanding
,
Proceedings of Interspeech 2019
, pp.
814
818
. .
47.
de Andrade
,
D. C.
,
Leo
,
S.
,
Viana
,
M. L.
, and
Bernkopf
,
C.
,
2018
,
A Neural Attention Model for Speech Command Recognition
.
arXiv preprint arXiv:1808.08929
.
48.
Kim
,
T.
,
Lee
,
J.
, and
Nam
,
J.
,
2019
, “
Comparison and Analysis of Sample CNN Architectures for Audio Classification
,”
IEEE J. Sel. Top. Signal Process.
,
13
(
2
), pp.
285
297
.
49.
Coniam
,
D.
,
1999
, “
Voice Recognition Software Accuracy With Second Language Speakers of English
,”
System
,
27
(
1
), pp.
49
64
.
50.
Amodei
,
D.
,
Ananthanarayanan
,
S.
,
Anubhai
,
R.
,
Bai
,
J.
,
Battenberg
,
E.
,
Case
,
C.
, and
Casper
,
J.
, “
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
,”
International Conference on Machine Learning
,
New York City
,
June 19–24
, pp.
173
182
, PMLR, 2016.
51.
Hamza
,
M.
,
Khodadadi
,
T.
, and
Palaniappan
,
S.
,
2020
, “
A Novel Automatic Voice Recognition System Based on Text-Independent in a Noisy Environment
,”
Int. J. Electr. Comput. Eng.
,
10
(
4
), pp.
3643
3650
. .
52.
Song
,
J.
,
Chen
,
B.
,
Jiang
,
K.
,
Yang
,
M.
, and
Xiao
,
X.
,
2019
, “
The Software System Implementation of Speech Command Recognizer Under Intensive Background Noise
,”
IOP Conference Series: Materials Science and Engineering
,
Changsha, China
,
Apr. 19–21
,
IOP Publishing
, Vol.
563
, no.
5
, p.
052090
.
You do not currently have access to this content.