Ophelian Spark ML Feature Miner

📊 Feature Mining with Ophelian

BuildStringIndex

BuildStringIndex is a class that computes a Spark DataFrame with string column indexing to numeric codes. It maps each unique string to a unique code number. By default, the most frequent label gets index 0, the next frequent gets index 1, and so on.

Note: Specifying estimator_path requires setting a directory name. This parameter creates a metadata model version on disk (e.g., HDFS) and helps reduce memory usage during training and prediction.

Parameters:

  • input_cols: String or list of string column names to index (categorical data type).
  • path: Disk path to persist metadata model estimator. Optional.
  • dir_name: Directory name to persist metadata model estimator inside path. Optional.

Example:

from ophelia.ml.feature_miner import BuildStringIndex
df = spark.createDataFrame(
    [('apple', 'red'), ('banana', 'yellow'), ('coconut', 'brown')],
    ['fruit_type', 'fruit_color']
)
string_cols_list = ['fruit_type', 'fruit_color']
indexer = BuildStringIndex(string_cols_list, '/path/estimator/save/metadata/', 'StringIndex')
indexer.transform(df).show(5, False)

BuildOneHotEncoder

BuildOneHotEncoder class builds a One Hot Encoder Estimator for Spark DataFrame. This maps a previously indexed category column to a binary vector, creating a unique binary vector for each string index.

Note: It can handle invalid categories (e.g., typos) by discarding them. Set handle_invalid = 'keep' to encode invalid values as an all-zero vector.

Parameters:

  • input_cols: String or list of index column names to encode.
  • path: Disk path to persist metadata model estimator. Optional.
  • dir_name: Directory name to persist metadata model estimator inside path. Optional.
  • drop_last: If True, creates a dummy encoding by removing the last binary category for an all-zero vector.
  • handle_invalid: Set by default to 'error' to discard typos or error types from categorical columns.

Example:

from ophelia.ml.feature_miner import BuildOneHotEncoder
df = spark.createDataFrame(
    [('0.0', '0.2'), ('0.1', '0.0'), ('0.2', '0.1')],
    ["fruit_type_index", "fruit_color_index"]
)
indexed_cols_list = ['fruit_type_index', 'fruit_color_index']
encoder = BuildOneHotEncoder(indexed_cols_list, '/path/estimator/save/metadata/', 'OneHotEncoder')
encoder.transform(df).show(5, False)

BuildVectorAssembler

BuildVectorAssembler assembles a vector from given columns, suitable for sparse and dense vectorization. Only numeric features are accepted.

Parameters:

  • input_cols: String or list of input column names (numeric features).
  • name_vec: Name for the vector transformation column. Default is 'features'.

Example:

from ophelia_spark.ml.feature_miner import BuildVectorAssembler
df = spark.createDataFrame(
    [(0.0, 1.0), (1.0, 0.0), (0.2, 0.1)],
    ["feature1", "feature2"]
)
input_cols = ['feature1', 'feature2']
assembler = BuildVectorAssembler(input_cols)
assembler.transform(df).show(5, False)

BuildStandardScaler

BuildStandardScaler scales features to have zero mean and unit variance, creating a dense output.

Parameters:

  • with_mean: Centers the data with mean before scaling. Default is False.
  • with_std: Scales the data to unit standard deviation. Default is True.
  • path: Disk path to persist metadata model estimator. Optional.
  • input_col: Name of the input column to scale. Default is 'features'.
  • output_col: Name of the output column to create with scaled features. Default is 'scaled_features'.

Example:

from ophelia_spark.ml.feature_miner import BuildStandardScaler
df = spark.createDataFrame(
    [(DenseVector([0.0, 1.0]),), (DenseVector([1.0, 0.0]),), (DenseVector([0.2, 0.1]),)],
    ["features"]
)
scaler = BuildStandardScaler()
scaler.transform(df).show(5, False)

SparkToNumpy

SparkToNumpy converts a Spark DataFrame to a NumPy array.

Parameters:

  • list_columns: List of columns to convert. Optional.

Example:

from ophelia_spark.ml.feature_miner import SparkToNumpy
df = spark.createDataFrame(
    [(0.0, 1.0), (1.0, 0.0), (0.2, 0.1)],
    ["feature1", "feature2"]
)
converter = SparkToNumpy(['feature1', 'feature2'])
numpy_array = converter.transform(df)
print(numpy_array)

NumpyToVector

NumpyToVector converts a NumPy array to a Spark DataFrame with vectorized features.

Parameters:

  • np_object: NumPy array to convert.
  • label_t: Label type column. Default is 1.

Example:

from ophelia_spark.ml.feature_miner import NumpyToVector
import numpy as np

np_array = np.array([[0.0, 1.0], [1.0, 0.0], [0.2, 0.1]])
converter = NumpyToVector(spark.sparkContext)
df = converter.transform(np_array)
df.show()

🛠️ Utility Functions

Gini Score

Computes the Gini index for a given node.

Parameters:

  • node: Dictionary representing the node.

Example:

node = {'class1': 10, 'class2': 30}
score = converter.gini_score(node)
print(score)

Entropy Score

Computes the entropy for a given node.

Parameters:

  • node: Dictionary representing the node.

Example:

node = {'class1': 10, 'class2': 30}
score = converter.entropy_score(node)
print(score)

Information Gain

Computes the information gain for a parent node and its children.

Parameters:

  • parent: Dictionary representing the parent node.
  • children: List of dictionaries representing the children nodes.
  • criterion: Criterion for computing the gain ('gini' or 'entropy').

Example:

parent = {'class1': 10, 'class2': 30}
children = [{'class1': 5, 'class2': 10}, {'class1': 5, 'class2': 20}]
gain = converter.information_gain(parent, children, 'gini')
print(gain)