文本向量化流程

文本向量化流程¶

本示例说明如何使用 Dask-ML 并行分类大型文本数据集。它改编自此 scikit-learn 示例。

主要区别在于

我们将整个模型（包括文本向量化）作为流水线进行拟合。
我们使用 Dask Bag、Dask Dataframe 和 Dask Array 等 Dask 集合，而不是生成器来处理大于内存的数据集。

[1]:

from dask.distributed import Client, progress

client = Client(n_workers=2, threads_per_worker=2, memory_limit='2GB')
client

[1]:

客户端

客户端-94cfa4ae-0de1-11ed-a521-000d3a8f7959

连接方法： Cluster object	集群类型： distributed.LocalCluster
仪表盘： http://127.0.0.1:8787/status

集群信息

本地集群

0f48ff5b

仪表盘： http://127.0.0.1:8787/status	工作进程 2
总线程数 4	总内存： 3.73 GiB
状态： running	使用进程： True

调度器信息

调度器

调度器-df50e6c2-504c-458c-904f-d9354f469039

通信： tcp://127.0.0.1:35661	工作进程 2
仪表盘： http://127.0.0.1:8787/status	总线程数 4
启动于： Just now	总内存： 3.73 GiB

工作进程

工作进程：0

通信： tcp://127.0.0.1:46267	总线程数 2
仪表盘： http://127.0.0.1:41243/status	内存： 1.86 GiB
Nanny： tcp://127.0.0.1:38595
本地目录： /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-imhji18u

工作进程：1

通信： tcp://127.0.0.1:44477	总线程数 2
仪表盘： http://127.0.0.1:42527/status	内存： 1.86 GiB
Nanny： tcp://127.0.0.1:34731
本地目录： /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-r6iwv690

获取数据¶

Scikit-Learn 提供了一个用于获取新闻组数据集的工具。

[2]:

import sklearn.datasets

bunch = sklearn.datasets.fetch_20newsgroups()

scikit-learn 中的数据不是太大，因此数据直接在内存中返回。每个文档都是一个字符串。我们预测的目标是一个整数，它表示帖子的主题。

我们将直接把文档和目标加载到 dask DataFrame 中。在实践中，对于大于内存的数据集，您可能会使用 dask.bag 或 dask.delayed 从磁盘或云存储加载文档。

[3]:

import dask.dataframe as dd
import pandas as pd

df = dd.from_pandas(pd.DataFrame({"text": bunch.data, "target": bunch.target}),
                    npartitions=25)

df

[3]:

Dask DataFrame 结构

	text	target
npartitions=25
0	object	int64
453	...	...
...	...	...
10872	...	...
11313	...	...

Dask 名称: from_pandas, 25 tasks

text 列中的每一行都包含一些元数据和帖子的完整文本。

[4]:

print(df.head().loc[0, 'text'][:500])

From: [email protected] (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tellme a m

特征哈希¶

Dask 的 HashingVectorizer 提供了与 scikit-learn 实现类似的 API。事实上，Dask-ML 的实现使用了 scikit-learn 的功能，并将其应用于输入的 dask.dataframe.Series 或 dask.bag.Bag 的每个分区。

转换（一旦我们实际计算结果）是并行进行的，并返回一个 dask Array。

[5]:

import dask_ml.feature_extraction.text

vect = dask_ml.feature_extraction.text.HashingVectorizer()
X = vect.fit_transform(df['text'])
X

[5]:

	数组	块
形状	(nan, 1048576)	(nan, 1048576)
数量	75 任务	25 块
类型	float64	scipy.sparse._csr.csr_matrix

输出数组 X 的块大小未知，因为输入的 dask Series 或 Bags 不知道自己的长度。

X 中的每个块都是一个 scipy.sparse 矩阵。

[6]:

X.blocks[0].compute()

[6]:

<453x1048576 sparse matrix of type '<class 'numpy.float64'>'
        with 64357 stored elements in Compressed Sparse Row format>

这是一个文档-词项矩阵。每一行是原始帖子的哈希表示。

分类流程¶

我们可以将 HashingVectorizer 与 Incremental 以及像 scikit-learn 的 SGDClassifier 这样的分类器结合起来，创建一个分类流程。

我们将预测该主题是否属于 comp 类别。

[7]:

bunch.target_names

[7]:

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

[8]:

import numpy as np

positive = np.arange(len(bunch.target_names))[['comp' in x for x in bunch.target_names]]
y = df['target'].isin(positive).astype(int)
y

[8]:

Dask Series Structure:
npartitions=25
0        int64
453        ...
         ...
10872      ...
11313      ...
Name: target, dtype: int64
Dask Name: astype, 101 tasks

[9]:

import numpy as np
import sklearn.linear_model
import sklearn.pipeline

import dask_ml.wrappers

因为输入来自 dask Series，其块大小未知，我们需要指定 assume_equal_chunks=True。这告诉 Dask-ML，我们知道 X 中的每个分区都与 y 中的一个分区匹配。

[10]:

sgd = sklearn.linear_model.SGDClassifier(
    tol=1e-3
)
clf = dask_ml.wrappers.Incremental(
    sgd, scoring='accuracy', assume_equal_chunks=True
)
pipe = sklearn.pipeline.make_pipeline(vect, clf)

SGDClassifier.partial_fit 需要提前知道完整的类别集合。因为我们的 sgd 被包装在 Incremental 内部，我们需要将其作为 fit 中的 incremental__classes 关键字参数传递。

[11]:

pipe.fit(df['text'], y,
         incremental__classes=[0, 1]);

像往常一样，Incremental.predict 惰性地将预测结果作为 dask Array 返回。

[12]:

predictions = pipe.predict(df['text'])
predictions

[12]:

	数组	块
字节	未知	未知
形状	(nan,)	(nan,)
数量	100 任务	25 块
类型	int64	numpy.ndarray

我们可以使用 dask_ml.metrics.accuracy_score 并行计算预测和得分。

[13]:

dask_ml.metrics.accuracy_score(y, predictions)

[13]:

0.950150256319604

HashingVectorizer 和 SGDClassifier 的这种简单组合在此预测任务上非常有效。

增量训练大数据集

使用 Dask 进行超参数优化

Dask 示例文档