data-analytics
  • Introduction
  • Azure HDInsight
    • Create a Spark cluster
    • Azure HDInsight SQL
  • Spark
    • RDD
      • Common RDD operations
      • Exercise: Compute Pi
      • Pair RDDs
      • Exercise: Word count
      • Exercise: Intrusion detection
      • Shared variables
    • File operations
    • Spark SQL
    • Data visualisation
      • Power BI visualisation
    • Machine learning
      • Statistics
      • Feature extraction
      • Classification & Regression
  • References
Powered by GitBook
On this page
  • Statistics.colStats(rdd)
  • Statistics.corr(rdd, method)
  • Statistics.corr(rdd1, rdd2, method)
  • Statistics.chiSqTest(rdd)
  1. Spark
  2. Machine learning

Statistics

PreviousMachine learningNextFeature extraction

Last updated 7 years ago

Basic statistics are an important part of data analysis, both in ad hoc exploration and understanding data for machine learning. MLlib offers several widely used statistic functions that work directly on RDDs, through methods in the mllib.stat.Statistics class. Some commonly used ones include:

Statistics.colStats(rdd)

Computes a statistical summary of an RDD of vectors, which stores the min, max, mean, and variance for each column in the set of vectors. This can be used to obtain a wide variety of statistics in one pass.

Statistics.corr(rdd, method)

Computes the correlation matrix between columns in an RDD of vectors, using either the Pearson or Spearman correlation ( method must be one of pearson and spearman ).

Statistics.corr(rdd1, rdd2, method)

Computes the correlation between two RDDs of floating-point values, using either the Pearson or Spearman correlation ( method must be one of pearson and spearman ).

Statistics.chiSqTest(rdd)

Computes Pearson’s independence test for every feature with the label on an RDD of LabeledPoint objects. Returns an array of ChiSqTestResult objects that capture the p-value, test statistic, and degrees of freedom for each feature. Label and feature values must be categorical (i.e., discrete values).

Apart from these methods, RDDs containing numeric data offer several basic statistics such as mean(), stdev(), and sum(). In addition, RDDs support sample() and sampleByKey() to build simple and stratified samples of data.

Jupyter notebook for stats
Solution: Jupyter notebook for stats