主要内容

语音情感识别

本示例演示了一个使用BiLSTM网络的简单语音情感识别(SER)系统。首先下载数据集,然后在各个文件上测试经过训练的网络。该网络是在一个小型德语数据库上训练的[1]

该示例指导您如何训练网络,其中包括下载、扩充和训练数据集。最后,执行10倍交叉验证,以评估网络架构。

本例中使用的特征是使用顺序特征选择来选择的,类似于音频特征的顺序特征选择(音频工具箱)

下载数据集

下载柏林情感演讲数据库[1].该数据库包含了10位演员的535句话语,旨在表达以下情绪之一:愤怒、无聊、厌恶、焦虑/恐惧、快乐、悲伤或中性。情感与文本无关。

dataFolder = tempdir;数据集= fullfile (dataFolder,“Emo-DB”);如果~ datasetExists(集)的url =“http://emodb.bilderbar.info/download/download.zip”;disp (下载Emo-DB (40.5 MB)…解压缩(url,数据集)结束
下载Emo-DB (40.5 MB)…

创建一个audioDatastore(音频工具箱)这就指向了音频文件。

广告= audioDatastore (fullfile(数据集,“wav”));

文件名是指示说话人ID、所讲文本、情绪和版本的代码。该网站包含一个解释代码的密钥,以及说话者的其他信息,如性别和年龄。创建一个包含变量的表演讲者而且情感.将文件名解码到表中。

filepaths = ads.Files;emotionCodes = cellfun (@ (x) x(录得5个),filepaths, UniformOutput = false);情感=取代(emotionCodes, (“W”“L”“E”“一个”“F”“T”“N”],...“愤怒”“无聊”“厌恶”“焦虑和恐惧”“幸福”“悲伤”“中性”]);speakerCodes = cellfun (@ (x) x (end-10: end-9) filepaths, UniformOutput = false);labelTable = cell2table (speakerCodes、情感,VariableNames = [“议长”“情感”]);labelTable。情感= categorical(labelTable.Emotion); labelTable.Speaker = categorical(labelTable.Speaker); summary(labelTable)
变量:演讲者:535×1分类值:03 49 08 58 09 43 10 38 11 55 12 35 13 61 14 69 15 56 16 71情绪:535×1分类值:愤怒127焦虑/恐惧69无聊81厌恶46快乐71中性79悲伤62

labelTable和文件的顺序是一样的吗audioDatastore.设置标签财产的audioDatastorelabelTable

ads.Labels = labelTable;

进行语音情感识别

下载并加载预先训练的网络audioFeatureExtractor(音频工具箱)对象用于训练网络,并对特征进行归一化因子。该网络使用数据集中除扬声器外的所有扬声器进行训练03

downloadFolder = matlab.internal.examples.downloadSupportFile (“音频”“SpeechEmotionRecognition.zip”);dataFolder = tempdir;unzip(downloadFolder,dataFolder) netFolder = fullfile(dataFolder,“SpeechEmotionRecognition”);负载(fullfile (netFolder“network_Audio_SER.mat”));

采样率设置在audioFeatureExtractor对应于数据集的采样率。

fs = afe.SampleRate;

选择一个说话者和情绪,然后对数据存储进行子集,以只包含所选的说话者和情绪。从数据存储中读取并监听文件。

演讲者=分类(“03”);情感=分类(“厌恶”);(广告、ads.Labels adsSubset =子集。演讲者==speaker & ads.Labels.Emotion==emotion); audio = read(adsSubset); sound(audio,fs)

使用audioFeatureExtractor对象提取特征,然后将它们转置,以便时间按行排列。将特征归一化,然后将它们转换为20个元素序列,其中包含10个元素重叠,这对应于大约600毫秒的窗口,其中包含300毫秒的重叠。利用配套功能,HelperFeatureVector2Sequence,将特征向量数组转换为序列。

特点=(提取(afe、音频))';featuresNormalized = (features - normalizers.Mean)./normalizers.StandardDeviation;numOverlap =10;featureSequences = HelperFeatureVector2Sequence (featuresNormalized 20 numOverlap);

将特征序列输入网络进行预测。计算平均预测,并绘制所选情绪的概率分布作为饼图。您可以尝试不同的说话者、情绪、序列重叠和预测平均来测试网络的性能。要获得网络性能的真实近似值,请使用扬声器03该网络并没有对此进行培训。

YPred =双(预测(网络,featureSequences));平均=“模式”开关平均情况下“的意思是”聚合氯化铝=意味着(YPred, 1);情况下“中值”聚合氯化铝=值(YPred, 1);情况下“模式”聚合氯化铝=模式(YPred, 1);结束派(probs. /笔(聚合氯化铝),字符串(net.Layers(结束). class))

该示例的其余部分说明了如何训练和验证网络。

列车网络的

由于训练数据不足,第一次尝试训练的10倍交叉验证精度约为60%。在数据不足的情况下训练的模型对一些折叠过拟合,而对另一些折叠过拟合。若要改善整体拟合性,请使用audioDataAugmenter(音频工具箱).根据经验,选择每个文件增加50个,以在处理时间和提高精度之间进行良好的权衡。您可以减少增加的次数来加快示例的速度。

创建一个audioDataAugmenter对象。设置应用螺距移动的概率为0.5并使用默认范围。设置应用时移的概率为1并使用一系列[-0.3, 0.3]秒。设置添加噪声的概率1并指定信噪比范围为(-20年,40)dB。

numAugmentations =50;增量= audioDataAugmenter (NumAugmentations = NumAugmentations,...TimeStretchProbability = 0,...VolumeControlProbability = 0,......PitchShiftProbability = 0.5,......TimeShiftProbability = 1,...TimeShiftRange = [-0.3, 0.3],......AddNoiseProbability = 1,...SNRRange = [-20, 40]);

在当前文件夹中创建一个新文件夹来保存增强的数据集。

currentDir = pwd;writeDirectory = fullfile (currentDir,“augmentedData”);mkdir (writeDirectory)

对于音频数据存储中的每个文件:

  1. 创建50个扩增。

  2. 规范化音频,使其最大绝对值为1。

  3. 将增强音频数据写入WAV文件。附加_augK到每个文件名,其中K是增广数。为了加快处理速度,使用parfor对数据存储进行分区。

这种扩充数据库的方法既耗时又耗空间。然而,当迭代选择网络架构或特征提取管道时,这种前期成本通常是有利的。

N =元素个数(ads.Files) * numAugmentations;reset(ads) numPartitions = 18;抽搐parforadsPart = partition(ads,numPartitions,ii);hasdata(adsPart) [x,adsInfo] = read(adsPart);data =增加(增压器,x, fs);[~, fn] = fileparts (adsInfo.FileName);i = 1:size(data,1) augmentedAudio = data. audio {i};augmentedAudio = augmentedAudio / max (abs (augmentedAudio), [],“所有”);augNum = num2str(我);如果numel(augNum)==1 iString = [' 0 'augNum);其他的iString = augNum;结束audiowrite (fullfile (writeDirectory sprintf (“% s_aug % s.wav”、fn iString))、augmentedAudio fs);结束结束结束disp (“增强完成”+ (toc / 60, 2) +“分钟”。
增加在3.84分钟内完成。

创建一个指向扩展数据集的音频数据存储。复制原始数据存储的标签表的行NumAugmentations确定增强数据存储的标签。

adsAug = audioDatastore (writeDirectory);adsAug。标签= repelem(ads.Labels,augmenter.NumAugmentations,1);

创建一个audioFeatureExtractor(音频工具箱)对象。集窗口到周期性的30毫秒汉明窗,OverlapLength0,SampleRate数据库的抽样率。集gtccgtccDeltamfccDelta,spectralCrest真正的提取它们。集SpectralDescriptorInputmelSpectrum这样spectralCrest为MEL谱计算。

赢得=汉明(圆(0.03 * fs),“周期”);overlapLength = 0;afe = audioFeatureExtractor (...窗口=赢,...OverlapLength = OverlapLength,...SampleRate = fs,......gtcc = true,...gtccDelta = true,...mfccDelta = true,......SpectralDescriptorInput =“melSpectrum”...spectralCrest = true);

训练部署

在为部署进行培训时,请使用数据集中所有可用的扬声器。将训练数据存储设置为增强数据存储。

adsTrain = adsAug;

将训练音频数据存储转换为一个高数组。如果您有并行计算工具箱™,则提取将自动并行化。如果没有并行计算工具箱™,代码将继续运行。

tallTrain =高(adsTrain);

提取训练特征,并对特征进行重新定位,使时间沿着行进行兼容sequenceInputLayer

featuresTallTrain = cellfun (@ (x)提取(afe x), tallTrain, UniformOutput = false);featuresTallTrain = cellfun (@ (x) x ', featuresTallTrain, UniformOutput = false);featuresTrain =收集(featuresTallTrain);
使用并行池“local”计算高表达式:—通过1(1):0%完成计算0%完成—通过1(1):在1分7秒内完成计算

利用训练集确定每个特征的均值和标准差。

allFeatures =猫(2,featuresTrain {:});M =意味着(allFeatures 2“omitnan”);S =性病(allFeatures 0 2,“omitnan”);featuresTrain = cellfun (@ (x)(即x m)。/ S, featuresTrain, UniformOutput = false);

将特征向量缓冲成序列,使每个序列包含20个特征向量,其中10个特征向量重叠。

featureVectorsPerSequence = 20;featureVectorOverlap = 10;[sequencesTrain, sequencePerFileTrain] = HelperFeatureVector2Sequence (featuresTrain、featureVectorsPerSequence featureVectorOverlap);

复制训练集和验证集的标签,使它们与序列一一对应。不是所有的说话者都能表达所有的情绪。创建一个空分类数组,包含所有情感类别,并将其附加到验证标签,使类别数组包含所有情感。

labelsTrain = repelem (adsTrain.Labels.Emotion [sequencePerFileTrain {:}));emptyEmotions = ads.Labels.Emotion;emptyEmotions (:) = [];

定义一个BiLSTM网络bilstmLayer.放置一个dropoutLayer前后bilstmLayer帮助防止过拟合。

dropoutProb1 = 0.3;numUnits = 200;dropoutProb2 = 0.6;层= [...sequenceInputLayer (afe.FeatureVectorLength) dropoutLayer dropoutProb1 bilstmLayer (numUnits OutputMode =“最后一次”) dropoutLayer(dropoutProb2) fulllyconnectedlayer (numl (categories(emptyEmotions))) softmaxLayer classificationLayer];

使用以下方法定义培训选项trainingOptions

miniBatchSize = 512;initialLearnRate = 0.005;learnRateDropPeriod = 2;maxEpochs = 3;选择= trainingOptions (“亚当”...MiniBatchSize = MiniBatchSize,...InitialLearnRate = InitialLearnRate,...LearnRateDropPeriod = LearnRateDropPeriod,...LearnRateSchedule =“分段”...MaxEpochs = MaxEpochs,...洗牌=“every-epoch”...Verbose = false,...情节=“训练进步”);

培训网络trainNetwork

网= trainNetwork (sequencesTrain、labelsTrain层,选择);

保存已配置的网络audioFeatureExtractor,归一化因子,集saveSERSystem真正的

saveSERSystem =如果saveSERSystem标准化者。意味着= M;标准化者。StandardDeviation = S;保存(“network_Audio_SER.mat”“净”“安全的”“标准化者”结束

系统验证培训

要对您在本例中创建的模型提供准确的评估,请使用leave-one speaker-out (LOSO)进行训练和验证。k倍交叉验证。在这种方法中,你训练使用 k - 1 扬声器,然后在遗漏的扬声器上验证。重复这个过程k次了。最后的验证精度是平均值k折叠。

创建一个包含演讲者id的变量。确定折叠的数量:每个扬声器1。该数据库包含来自10位独特演讲者的话语。使用总结显示说话人id(左列)和他们向数据库贡献的话语数(右列)。

演讲者= ads.Labels.Speaker;numFolds =元素个数(扬声器);总结(扬声器)
03 49 08 58 09 43 10 38 11 55 12 35 13 61 14 69 15 56 16 71

辅助函数HelperTrainAndValidateNetwork对所有10次折叠执行上述步骤,并为每一次折叠返回真实和预测的标签。调用HelperTrainAndValidateNetworkaudioDatastore,增强audioDatastore,audioFeatureExtractor

[labelsTrue, labelsPred] = HelperTrainAndValidateNetwork(广告、adsAug afe);

打印每折精度,绘制10折混淆图。

ii = 1:numel(labelsTrue) foldAcc = mean(labelsTrue{ii}==labelsPred{ii})*100;disp (“折”+ 2 +",精度= "+圆(foldAcc, 2))结束
折叠1,精确度= 65.31折叠2,精确度= 68.97折叠3,精确度= 79.07折叠4,精确度= 71.05折叠5,精确度= 72.73折叠6,精确度= 74.29折叠7,精确度= 67.21折叠8,精确度= 85.51折叠9,精确度= 71.43折叠10,精确度= 67.61
labelsTrueMat =猫(1,labelsTrue {:});labelsPredMat =猫(1,labelsPred {:});图cm = confusichart (labelsTrueMat,labelsPredMat,...Title = (10倍交叉验证的混淆矩阵“平均精度=”圆的(意味着(labelsTrueMat = = labelsPredMat) * 100, 1)),...ColumnSummary =“column-normalized”RowSummary =“row-normalized”);sortClasses(厘米、类别(emptyEmotions))

支持功能

将特征向量数组转换为序列

函数[序列,sequencePerFile] = HelperFeatureVector2Sequence(特性、featureVectorsPerSequence featureVectorOverlap)版权所有2019 MathWorks, Inc.如果featureVectorsPerSequence <= featureVectorOverlap错误(重叠特征向量的数量必须小于每个序列的特征向量的数量结束如果~iscell(features) features = {features};结束hopLength = featureVectorsPerSequence - featureVectorOverlap;idx1 = 1;序列= {};sequencePerFile =细胞(元素个数(特性),1);ii = 1:numel(features) sequencePerFile{ii} = floor((size(features{ii},2) - featureVectorsPerSequence)/hopLength) + 1;idx2 = 1;j = 1:sequencePerFile{ii} sequences{idx1,1} = features{ii}(:,idx2:idx2 + featureVectorsPerSequence - 1);% #好< AGROW >Idx1 = Idx1 + 1;idx2 = idx2 + hopLength;结束结束结束

培训和验证网络

函数[trueLabelsCrossFold, predictedLabelsCrossFold] = HelperTrainAndValidateNetwork(变长度输入宗量)版权所有2019 The MathWorks, Inc.如果Nargin == 3 ads = varargin{1};augads =变长度输入宗量{2};器=变长度输入宗量{3};elseifNargin == 2 ads = varargin{1};augads =变长度输入宗量{1};器=变长度输入宗量{2};结束演讲者=类别(ads.Labels.Speaker);numFolds =元素个数(扬声器);emptyEmotions = (ads.Labels.Emotion);emptyEmotions (:) = [];%对每个折叠进行循环。trueLabelsCrossFold = {};predictedLabelsCrossFold = {};我= 1:numFolds% 1。将音频数据存储分为训练集和验证集。将数据转换为高数组。idxTrain = augads.Labels.Speaker ~ =议长(i);augadsTrain =子集(augads idxTrain);augadsTrain。标签= augadsTrain.Labels.Emotion; tallTrain = tall(augadsTrain); idxValidation = ads.Labels.Speaker==speaker(i); adsValidation = subset(ads,idxValidation); adsValidation.Labels = adsValidation.Labels.Emotion; tallValidation = tall(adsValidation);% 2。从训练集中提取特征。重新定位的功能这样时间是顺着行来兼容的。% sequenceInputLayer。tallTrain = cellfun (@ x (x) / max (abs (x)、[]“所有”)、tallTrain UniformOutput = false);tallFeaturesTrain = cellfun (@ (x)提取(萃取器,x), tallTrain, UniformOutput = false);tallFeaturesTrain = cellfun (@ (x) x ', tallFeaturesTrain, UniformOutput = false);% #好< NASGU >[~, featuresTrain] = evalc (“收集(tallFeaturesTrain)”);%使用evalc抑制命令行输出。tallValidation = cellfun (@ x (x) / max (abs (x)、[]“所有”)、tallValidation UniformOutput = false);tallFeaturesValidation = cellfun (@ (x)提取(萃取器,x), tallValidation,“UniformOutput”、假);tallFeaturesValidation = cellfun (@ (x) x ', tallFeaturesValidation, UniformOutput = false);% #好< NASGU >[~, featuresValidation] = evalc (“收集(tallFeaturesValidation)”);%使用evalc抑制命令行输出。% 3。利用训练集确定均值和标准每个特征的%偏差。规范化培训和验证%设置。allFeatures =猫(2,featuresTrain {:});M =意味着(allFeatures 2“omitnan”);S =性病(allFeatures 0 2,“omitnan”);featuresTrain = cellfun (@ (x)(即x m)。/ S, featuresTrain, UniformOutput = false);ii = 1:numel(featuresTrain) idx = find(isnan(featuresTrain{ii}));如果~isempty(idx) featuresTrain{ii}(idx) = 0;结束结束featuresValidation = cellfun (@ (x)(即x m)。/ S, featuresValidation, UniformOutput = false);ii = 1:numel(featuresValidation) idx = find(isnan(featuresValidation{ii}));如果~isempty(idx) featuresValidation{ii}(idx) = 0;结束结束% 4。缓冲序列,使每个序列包含20个10个特征向量重叠的%特征向量。featureVectorsPerSequence = 20;featureVectorOverlap = 10;[sequencesTrain, sequencePerFileTrain] = HelperFeatureVector2Sequence (featuresTrain、featureVectorsPerSequence featureVectorOverlap);[sequencesValidation, sequencePerFileValidation] = HelperFeatureVector2Sequence (featuresValidation、featureVectorsPerSequence featureVectorOverlap);% 5。复制训练集和验证集的标签,以便它们与序列一一对应。。labelsTrain = [emptyEmotions; augadsTrain.Labels];labelsTrain = labelsTrain (:);labelsTrain = repelem (labelsTrain [sequencePerFileTrain {:}));% 6。定义BiLSTM网络。dropoutProb1 = 0.3;numUnits = 200;dropoutProb2 = 0.6;层= [...sequenceInputLayer(大小(sequencesTrain {1}, 1)) dropoutLayer dropoutProb1 bilstmLayer (numUnits OutputMode =“最后一次”) dropoutLayer(dropoutProb2) fulllyconnectedlayer (numl (categories(emptyEmotions))) softmaxLayer classificationLayer];% 7。定义培训选项。miniBatchSize = 512;initialLearnRate = 0.005;learnRateDropPeriod = 2;maxEpochs = 3;选择= trainingOptions (“亚当”...MiniBatchSize = MiniBatchSize,...InitialLearnRate = InitialLearnRate,...LearnRateDropPeriod = LearnRateDropPeriod,...LearnRateSchedule =“分段”...MaxEpochs = MaxEpochs,...洗牌=“every-epoch”...Verbose = false);% 8。培训网络。网= trainNetwork (sequencesTrain、labelsTrain层,选择);% 9。评估网络。调用classification来获得预测的标签%为每个序列。获取每个预测标签的模式%序列来获取每个文件的预测标签。predictedLabelsPerSequence =分类(净,sequencesValidation);trueLabels =分类(adsValidation.Labels);predictedLabels = trueLabels;idx1 = 1;ii = 1:numel(trueLabels) predictedLabels(ii,:) = mode(predictedLabelsPerSequence(idx1:idx1 + sequenceperfilevalination {ii} - 1,:),1);idx1 = idx1 + sequenceperfilevalalization {ii};结束trueLabelsCrossFold{我}= trueLabels;% #好< AGROW >predictedLabelsCrossFold{我}= predictedLabels;% #好< AGROW >结束结束

参考文献

[1] Burkhardt, F., A. Paeschke, M. Rolfes, W.F. Sendlmeier, B. Weiss,《德语情感言语的数据库》。在诉讼Interspeech 2005.里斯本,葡萄牙:国际言语交流协会,2005年。

另请参阅

|||

相关的话题

Baidu
map