Ophelian Spark ML Unsupervised
PCAnalysis
PCAnalysis
is used for Principal Component Analysis (PCA) on Spark DataFrames. It reduces the dimensionality of the data by transforming features to a set of principal components.
Parameters:
- k: Number of principal components.
- metadata_path: Path to save metadata model.
Example:
from ophelia_spark.ml.unsupervised.feature import PCAnalysis
from ophelia_spark import DenseVector
df = spark.createDataFrame(
[(DenseVector([0.0, 1.0]),), (DenseVector([1.0, 0.0]),), (DenseVector([0.2, 0.1]),)],
["scaled_features"]
)
pca = PCAnalysis(k=2, metadata_path='/path/to/save/pca_model')
result = pca.transform(df)
result.show()
SingularVD
SingularVD
is used for Singular Value Decomposition (SVD) on Spark DataFrames. It decomposes a matrix into three other matrices, capturing the essential structure of the data.
Parameters:
- k: Number of singular values and vectors.
- offset: Threshold for cumulative variance.
- label_col: Label column name.
Example:
from ophelia_spark.ml.unsupervised.feature import SingularVD
from ophelia_spark import DenseVector
df = spark.createDataFrame(
[(DenseVector([0.0, 1.0]),), (DenseVector([1.0, 0.0]),), (DenseVector([0.2, 0.1]),)],
["features"]
)
svd = SingularVD(k=2, offset=95, label_col='label')
result = svd.transform(df)
result.show()
IndependentComponent
IndependentComponent
is used for Independent Component Analysis (ICA) on Spark DataFrames. It separates a multivariate signal into additive, independent components.
Parameters:
- n_components: Number of independent components.
Example:
from ophelia_spark.ml.unsupervised.feature import IndependentComponent
from ophelia_spark import DenseVector
df = spark.createDataFrame(
[(DenseVector([0.0, 1.0]),), (DenseVector([1.0, 0.0]),), (DenseVector([0.2, 0.1]),)],
["features"]
)
ica = IndependentComponent(n_components=2)
result = ica.transform(df)
result.show()
LinearDAnalysis
LinearDAnalysis
is used for Linear Discriminant Analysis (LDA) on Spark DataFrames. It finds a linear combination of features that characterizes or separates two or more classes.
Parameters:
- n_components: Number of components for dimensionality reduction.
Example:
from ophelia_spark.ml.unsupervised.feature import LinearDAnalysis
from ophelia_spark import DenseVector
df = spark.createDataFrame(
[(DenseVector([0.0, 1.0]), 1), (DenseVector([1.0, 0.0]), 0), (DenseVector([0.2, 0.1]), 1)],
["features", "label"]
)
lda = LinearDAnalysis(n_components=1)
result = lda.transform(df)
result.show()
LLinearEmbedding
LLinearEmbedding
is used for Locally Linear Embedding (LLE) on Spark DataFrames. It reduces the dimensionality of data while preserving the relationships between neighboring data points.
Parameters:
- n_neighbors: Number of neighbors to consider for each point.
- n_components: Number of dimensions for the embedded space.
Example:
from ophelia_spark.ml.unsupervised.feature import LLinearEmbedding
from ophelia_spark import DenseVector
df = spark.createDataFrame(
[(DenseVector([0.0, 1.0]),), (DenseVector([1.0, 0.0]),), (DenseVector([0.2, 0.1]),)],
["features"]
)
lle = LLinearEmbedding(n_neighbors=5, n_components=2)
result = lle.transform(df)
result.show()
StochasticNeighbor
StochasticNeighbor
is used for t-distributed Stochastic Neighbor Embedding (t-SNE) on Spark DataFrames. It is a nonlinear dimensionality reduction technique that is well-suited for embedding high-dimensional data for visualization in a low-dimensional space.
Parameters:
- n_components: Number of dimensions for the embedded space.
- perplexity: Perplexity parameter for t-SNE.
- learning_rate: Learning rate parameter for t-SNE.
- n_iter: Number of iterations for optimization.
Example:
from ophelia_spark.ml.unsupervised.feature import StochasticNeighbor
from ophelia_spark import DenseVector
df = spark.createDataFrame(
[(DenseVector([0.0, 1.0]),), (DenseVector([1.0, 0.0]),), (DenseVector([0.2, 0.1]),)],
["features"]
)
tsne = StochasticNeighbor(n_components=2, perplexity=30.0, learning_rate=200.0, n_iter=1000)
result = tsne.transform(df)
result.show()
Updated 13 days ago