Machine learning
Last updated
Last updated
MLlib is Spark’s library of machine learning functions. Designed to run in parallel on clusters, MLlib contains a variety of learning algorithms and is accessible from all of Spark’s programming languages.
warning As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package.
MLlib supports local vectors and matrices stored on a single machine, as well as distributed matrices backed by one or more RDDs. Local vectors and local matrices are simple data models that serve as public interfaces.
A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values. For example, a vector (1.0, 0.0, 3.0)
can be represented in dense format as [1.0, 0.0, 3.0]
or in sparse format as (3, [0, 2], [1.0, 3.0])
, where 3
is the size of the vector.
MLlib recognizes the following types as dense vectors:
NumPy’s array
Python’s list, e.g., [1, 2, 3]
and the following as sparse vectors:
MLlib’s SparseVector.
SciPy’s csc_matrix with a single column
Info recommend using NumPy arrays over lists for efficiency, and using the factory methods implemented in Vectors to create sparse vectors.
A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either 0 (negative) or 1 (positive). For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ...
It is very common in practice to have sparse training data. MLlib supports reading training examples stored in LIBSVM
format, which is the default format used by LIBSVM
and LIBLINEAR
. It is a text format in which each line represents a labeled sparse feature vector using the following format:
where the indices are one-based and in ascending order. After loading, the feature indices are converted to zero-based.