Ophelian Spark ML Unsupervised

PCAnalysis

PCAnalysis is used for Principal Component Analysis (PCA) on Spark DataFrames. It reduces the dimensionality of the data by transforming features to a set of principal components.

Parameters:

  • k: Number of principal components.
  • metadata_path: Path to save metadata model.

Example:

from ophelia_spark.ml.unsupervised.feature import PCAnalysis
from ophelia_spark import DenseVector

df = spark.createDataFrame(
    [(DenseVector([0.0, 1.0]),), (DenseVector([1.0, 0.0]),), (DenseVector([0.2, 0.1]),)],
    ["scaled_features"]
)
pca = PCAnalysis(k=2, metadata_path='/path/to/save/pca_model')
result = pca.transform(df)
result.show()

SingularVD

SingularVD is used for Singular Value Decomposition (SVD) on Spark DataFrames. It decomposes a matrix into three other matrices, capturing the essential structure of the data.

Parameters:

  • k: Number of singular values and vectors.
  • offset: Threshold for cumulative variance.
  • label_col: Label column name.

Example:

from ophelia_spark.ml.unsupervised.feature import SingularVD
from ophelia_spark import DenseVector

df = spark.createDataFrame(
    [(DenseVector([0.0, 1.0]),), (DenseVector([1.0, 0.0]),), (DenseVector([0.2, 0.1]),)],
    ["features"]
)
svd = SingularVD(k=2, offset=95, label_col='label')
result = svd.transform(df)
result.show()

IndependentComponent

IndependentComponent is used for Independent Component Analysis (ICA) on Spark DataFrames. It separates a multivariate signal into additive, independent components.

Parameters:

  • n_components: Number of independent components.

Example:

from ophelia_spark.ml.unsupervised.feature import IndependentComponent
from ophelia_spark import DenseVector

df = spark.createDataFrame(
    [(DenseVector([0.0, 1.0]),), (DenseVector([1.0, 0.0]),), (DenseVector([0.2, 0.1]),)],
    ["features"]
)
ica = IndependentComponent(n_components=2)
result = ica.transform(df)
result.show()

LinearDAnalysis

LinearDAnalysis is used for Linear Discriminant Analysis (LDA) on Spark DataFrames. It finds a linear combination of features that characterizes or separates two or more classes.

Parameters:

  • n_components: Number of components for dimensionality reduction.

Example:

from ophelia_spark.ml.unsupervised.feature import LinearDAnalysis
from ophelia_spark import DenseVector

df = spark.createDataFrame(
    [(DenseVector([0.0, 1.0]), 1), (DenseVector([1.0, 0.0]), 0), (DenseVector([0.2, 0.1]), 1)],
    ["features", "label"]
)
lda = LinearDAnalysis(n_components=1)
result = lda.transform(df)
result.show()

LLinearEmbedding

LLinearEmbedding is used for Locally Linear Embedding (LLE) on Spark DataFrames. It reduces the dimensionality of data while preserving the relationships between neighboring data points.

Parameters:

  • n_neighbors: Number of neighbors to consider for each point.
  • n_components: Number of dimensions for the embedded space.

Example:

from ophelia_spark.ml.unsupervised.feature import LLinearEmbedding
from ophelia_spark import DenseVector

df = spark.createDataFrame(
    [(DenseVector([0.0, 1.0]),), (DenseVector([1.0, 0.0]),), (DenseVector([0.2, 0.1]),)],
    ["features"]
)
lle = LLinearEmbedding(n_neighbors=5, n_components=2)
result = lle.transform(df)
result.show()

StochasticNeighbor

StochasticNeighbor is used for t-distributed Stochastic Neighbor Embedding (t-SNE) on Spark DataFrames. It is a nonlinear dimensionality reduction technique that is well-suited for embedding high-dimensional data for visualization in a low-dimensional space.

Parameters:

  • n_components: Number of dimensions for the embedded space.
  • perplexity: Perplexity parameter for t-SNE.
  • learning_rate: Learning rate parameter for t-SNE.
  • n_iter: Number of iterations for optimization.

Example:

from ophelia_spark.ml.unsupervised.feature import StochasticNeighbor
from ophelia_spark import DenseVector

df = spark.createDataFrame(
    [(DenseVector([0.0, 1.0]),), (DenseVector([1.0, 0.0]),), (DenseVector([0.2, 0.1]),)],
    ["features"]
)
tsne = StochasticNeighbor(n_components=2, perplexity=30.0, learning_rate=200.0, n_iter=1000)
result = tsne.transform(df)
result.show()