📉 🛌🏿 🍂 YouTube在谈论什么 🛎️ ✔️ 🍼

在机器学习的曙光中，大多数解决方案看起来非常奇怪，孤立和不寻常。如今，许多ML算法已经排在一个程序员熟悉的框架和工具箱中，您可以在不熟悉其实现细节的情况下使用它们。

顺便说一句，我反对这种肤浅的做法，但我想向我的同事们表明，这个行业正在飞跃发展，将其成就应用于生产项目并不复杂。

作为示例，我将展示如何在我们的工作流服务中帮助用户找到数百种正确的视频资料。

在我的项目中，用户创建并共享数百种不同的材料：各种格式的文本，图片，视频，文章，文档。

搜索文档似乎很简单。但是，如何处理多媒体内容呢？对于完整的用户服务，您必须填写说明，为视频或图片起一个名字，几个标签不会受到伤害。不幸的是，并不是每个人都想花时间在这种内容改进上。通常，用户将链接上传到youtube，报告这是新视频，然后点击保存。服务可以处理此类“灰色”内容吗？首先想到的是问YouTube？但是YouTube也充满了用户（通常是同一用户）。通常，视频资料可能不是来自YouTube服务。
因此，我想到了要教我们的服务“收听”视频并“了解”其含义的想法。

我承认，这个想法并不是什么新鲜事物，但是今天要实现它，就不必有十名数据科学家的工作人员，而只需要两天时间和少量硬件资源。

问题陈述

我们的微服务称为Summarizer ，应：

从媒体服务下载视频；
提取音轨；
听音频，实际上是语音到文本；
找到20个关键字；
从文本中选择一个句子，这可以使视频的本质最大化；
将所有结果发送到内容服务；

我们将信任Python的实现，因此您不必处理与现成的ML解决方案的集成。

第一步：音频转文字。

首先，安装所有必需的组件。

pip3 install wave numpy tensorflow youtube_dl ffmpeg-python deepspeech nltk networkx brew install ffmpeg wget

接下来，从Mozilla-Deepspeech下载并解压缩训练后的语音到文本解决方案模型。

 mkdir /Users/Volodymyr/Projects/deepspeech/ cd /Users/Volodymyr/Projects/deepspeech/ wget https://github.com/mozilla/DeepSpeech/releases/download/v0.3.0/deepspeech-0.3.0-models.tar.gz tar zxvf deepspeech-0.3.0-models.tar.gz

Mozilla团队创建并培训了一个相当不错的解决方案，该解决方案使用TensorFlow可以将较长的音频剪辑转换成高质量的大文本块。 TensorFlow还允许您在CPU和GPU上开箱即用。

我们的代码将从下载内容开始。出色的youtube-dl库将为他提供帮助，该库具有内置的后处理器，可以将视频转换为所需的格式。不幸的是，后处理器代码有点受限制，它不知道如何重新采样，因此我们将提供帮助。
对于Deepspeech输入，您需要提交带有单声道和16K样本的音频文件。为此，我们需要重新处理收到的文件。

 _ = ffmpeg.input(youtube_id + '.wav').output(output_file_name, ac=1, t=crop_time, ar='16k').overwrite_output().run(capture_stdout=False)

在同一操作中，我们还可以通过传递附加参数“ t”来限制文件的持续时间。

下载deepspeech模型。

 deepspeech = Model(args.model, N_FEATURES, N_CONTEXT, args.alphabet, BEAM_WIDTH)

使用wave库，我们以np.array格式提取帧并将其传输到Deepspeech库输入。

 fin = wave.open(file_name, 'rb') framerate_sample = fin.getframerate() if framerate_sample != 16000: print('Warning: original sample rate ({}) is different than 16kHz. Resampling might produce erratic speech recognition.'.format(framerate_sample), file=sys.stderr) fin.close() return else: audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16) audio_length = fin.getnframes() * (1/16000) fin.close() print('Running inference.', file=sys.stderr) inference_start = timer() result = deepspeech.stt(audio, framerate_sample)

一段时间后，您会收到与硬件资源成比例的文字。

第二步：找到“含义”

为了搜索描述结果文本的单词，我将使用graph方法。该方法基于序列之间的距离。首先，我们找到所有“独特”的单词，假设这些是我们图形的顶点。在浏览了具有给定长度的“窗口”的文本之后，我们找到了单词之间的距离，它们将成为gaff的边缘。可以通过仅选择语音某一部分的词汇单元的语法过滤器来限制添加到图中的顶点。例如，您只能考虑将名词和动词添加到图中。因此，我们将仅基于可在名词和动词之间建立的关系来构建潜在边。

与每个顶点关联的分数设置为初始值1，并启动排名算法。排名算法是“投票”或“推荐”。当一个顶点连接到另一个顶点时，它将对此（连接的）顶点“投票”。为一个高峰投的票数越高，该高峰的重要性就越高。此外，投票顶部的重要性决定了投票本身的重要性，并且排名模型也考虑了此信息。因此，与顶部相关的分数是基于为其投下的票数以及获得这些选票的峰的等级来确定的。

假设我们有一个图G =（V，E），由顶点V和边E来描述。对于给定的顶点V，让一组顶点E与之相连。对于每个顶点Vi，都有与之关联的In（Vi）顶点和与之相关联的Out（Vi）顶点。因此，顶点Vi的权重可以由公式表示。

S \ big（V_ {i} \ big）= \ big（1-d \ big）+ d * \ sum_ {j \ in（V_ {i}）}} frac {1} {\中出（V_ {j}）\ mid} * S \大（V_ {j} \大）

$S \ big（V_ {i} \ big）= \ big（1-d \ big）+ d * \ sum_ {j \ in（V_ {i}）}} frac {1} {\中出（V_ {j}）\ mid} * S \大（V_ {j} \大）$

其中d是衰减/抑制因子，取值为1到0。

该算法应多次迭代图以获得粗略估计。
在获得图中每个顶点的近似估计后，将这些顶点按降序排序。出现在列表顶部的顶点将是我们想要的关键字。

文本中最相关的句子是通过找到该句子中所有单词的评分加总的平均值来找到的。也就是说，我们将所有估算值相加，然后除以句子中的单词数。

 iMac:YoutubeSummarizer $ cd /Users/Volodymyr/Projects/YoutubeSummarizer ; env "PYTHONIOENCODING=UTF-8" "PYTHONUNBUFFERED=1" /usr/local/bin/python3 /Users/Volodymyr/.vscode/extensions/ms-python.python-2018.11.0/pythonFiles/experimental/ptvsd_launcher.py --default --client --host localhost --port 53730 /Users/Volodymyr/Projects/YoutubeSummarizer/summarizer.py --youtube-id yA-FCxFQNHg --model /Users/Volodymyr/Projects/deepspeech/models/output_graph.pb --alphabet /Users/Volodymyr/Projects/deepspeech/models/alphabet.txt --lm /Users/Volodymyr/Projects/deepspeech/models/lm.binary --trie /Users/Volodymyr/Projects/deepspeech/models/trie --crop-time 900 Done downloading, now converting ... ffmpeg version 4.1 Copyright (c) 2000-2018 the FFmpeg developers built with Apple LLVM version 10.0.0 (clang-1000.11.45.5) configuration: --prefix=/usr/local/Cellar/ffmpeg/4.1 --enable-shared --enable-pthreads --enable-version3 --enable-hardcoded-tables --enable-avresample --cc=clang --host-cflags= --host-ldflags= --enable-ffplay --enable-gpl --enable-libmp3lame --enable-libopus --enable-libsnappy --enable-libtheora --enable-libvorbis --enable-libvpx --enable-libx264 --enable-libx265 --enable-libxvid --enable-lzma --enable-opencl --enable-videotoolbox libavutil 56. 22.100 / 56. 22.100 libavcodec 58. 35.100 / 58. 35.100 libavformat 58. 20.100 / 58. 20.100 libavdevice 58. 5.100 / 58. 5.100 libavfilter 7. 40.101 / 7. 40.101 libavresample 4. 0. 0 / 4. 0. 0 libswscale 5. 3.100 / 5. 3.100 libswresample 3. 3.100 / 3. 3.100 libpostproc 55. 3.100 / 55. 3.100 Guessed Channel Layout for Input Stream #0.0 : stereo Input #0, wav, from 'yA-FCxFQNHg.wav': Metadata: encoder : Lavf58.20.100 Duration: 00:17:27.06, bitrate: 1536 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, stereo, s16, 1536 kb/s Stream mapping: Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native)) Press [q] to stop, [?] for help Output #0, wav, to 'result-yA-FCxFQNHg.wav': Metadata: ISFT : Lavf58.20.100 Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s Metadata: encoder : Lavc58.35.100 pcm_s16le size= 28125kB time=00:15:00.00 bitrate= 256.0kbits/s speed=1.02e+03x video:0kB audio:28125kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.000271% Loading model from file /Users/Volodymyr/Projects/deepspeech/models/output_graph.pb TensorFlow: v1.11.0-9-g97d851f04e DeepSpeech: unknown Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage. 2018-12-14 17:42:03.121170: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA Loaded model in 0.5s. Loading language model from files /Users/Volodymyr/Projects/deepspeech/models/lm.binary /Users/Volodymyr/Projects/deepspeech/models/trie Loaded language model in 3.17s. Running inference. Building top 20 keywords... {'communicate', 'government', 'repetition', 'terrorism', 'technology', 'thinteeneighty', 'incentive', 'ponsibility', 'experience', 'upsetting', 'democracy', 'infection', 'difference', 'evidesrisia', 'legislature', 'metriamatrei', 'believing', 'administration', 'antagethetruth', 'information', 'conspiracy'} Building summary sentence... intellectually antagethetruth administration thinteeneighty understanding metriamatrei shareholders evidesrisia recognizing ponsibility communicate information legislature abaddoryis technology difference conspiracy repetition experience government protecting categories mankyuses democracy campaigns primarily attackers terrorism believing happening infection seriously incentive upsetting testified fortunate questions president companies prominent actually platform massacre powerful building poblanas thinking supposed accounts murdered function unsolved perverse recently fighting opposite motional election children watching traction speaking measured nineteen repeated coverage imagined positive designed together countess greatest fourteen attacks publish brought through explain russian opinion winking somehow welcome trithis problem looking college gaining feoryhe talking ighting believe happens connect further working ational mistake diverse between ferring Inference took 76.729s for 900.000s audio file.

总结

与三四年前相比，今天想到的想法可以更快地实现。试试吧，实验！我认为人工智能主要是与之合作的工程师的智慧。

该代码可在我的Github存储库中找到。

YouTube在谈论什么

问题陈述

第二步：找到“含义”

总结

More articles: