Ophelian Spark ML Feature Miner
📊 Feature Mining with Ophelian
BuildStringIndex
BuildStringIndex
is a class that computes a Spark DataFrame with string column indexing to numeric codes. It maps each unique string to a unique code number. By default, the most frequent label gets index 0, the next frequent gets index 1, and so on.
Note: Specifying estimator_path
requires setting a directory name. This parameter creates a metadata model version on disk (e.g., HDFS) and helps reduce memory usage during training and prediction.
Parameters:
- input_cols: String or list of string column names to index (categorical data type).
- path: Disk path to persist metadata model estimator. Optional.
- dir_name: Directory name to persist metadata model estimator inside
path
. Optional.
Example:
from ophelia.ml.feature_miner import BuildStringIndex
df = spark.createDataFrame(
[('apple', 'red'), ('banana', 'yellow'), ('coconut', 'brown')],
['fruit_type', 'fruit_color']
)
string_cols_list = ['fruit_type', 'fruit_color']
indexer = BuildStringIndex(string_cols_list, '/path/estimator/save/metadata/', 'StringIndex')
indexer.transform(df).show(5, False)
BuildOneHotEncoder
BuildOneHotEncoder
class builds a One Hot Encoder Estimator for Spark DataFrame. This maps a previously indexed category column to a binary vector, creating a unique binary vector for each string index.
Note: It can handle invalid categories (e.g., typos) by discarding them. Set handle_invalid = 'keep'
to encode invalid values as an all-zero vector.
Parameters:
- input_cols: String or list of index column names to encode.
- path: Disk path to persist metadata model estimator. Optional.
- dir_name: Directory name to persist metadata model estimator inside
path
. Optional. - drop_last: If
True
, creates a dummy encoding by removing the last binary category for an all-zero vector. - handle_invalid: Set by default to 'error' to discard typos or error types from categorical columns.
Example:
from ophelia.ml.feature_miner import BuildOneHotEncoder
df = spark.createDataFrame(
[('0.0', '0.2'), ('0.1', '0.0'), ('0.2', '0.1')],
["fruit_type_index", "fruit_color_index"]
)
indexed_cols_list = ['fruit_type_index', 'fruit_color_index']
encoder = BuildOneHotEncoder(indexed_cols_list, '/path/estimator/save/metadata/', 'OneHotEncoder')
encoder.transform(df).show(5, False)
BuildVectorAssembler
BuildVectorAssembler
assembles a vector from given columns, suitable for sparse and dense vectorization. Only numeric features are accepted.
Parameters:
- input_cols: String or list of input column names (numeric features).
- name_vec: Name for the vector transformation column. Default is 'features'.
Example:
from ophelia_spark.ml.feature_miner import BuildVectorAssembler
df = spark.createDataFrame(
[(0.0, 1.0), (1.0, 0.0), (0.2, 0.1)],
["feature1", "feature2"]
)
input_cols = ['feature1', 'feature2']
assembler = BuildVectorAssembler(input_cols)
assembler.transform(df).show(5, False)
BuildStandardScaler
BuildStandardScaler
scales features to have zero mean and unit variance, creating a dense output.
Parameters:
- with_mean: Centers the data with mean before scaling. Default is
False
. - with_std: Scales the data to unit standard deviation. Default is
True
. - path: Disk path to persist metadata model estimator. Optional.
- input_col: Name of the input column to scale. Default is 'features'.
- output_col: Name of the output column to create with scaled features. Default is 'scaled_features'.
Example:
from ophelia_spark.ml.feature_miner import BuildStandardScaler
df = spark.createDataFrame(
[(DenseVector([0.0, 1.0]),), (DenseVector([1.0, 0.0]),), (DenseVector([0.2, 0.1]),)],
["features"]
)
scaler = BuildStandardScaler()
scaler.transform(df).show(5, False)
SparkToNumpy
SparkToNumpy
converts a Spark DataFrame to a NumPy array.
Parameters:
- list_columns: List of columns to convert. Optional.
Example:
from ophelia_spark.ml.feature_miner import SparkToNumpy
df = spark.createDataFrame(
[(0.0, 1.0), (1.0, 0.0), (0.2, 0.1)],
["feature1", "feature2"]
)
converter = SparkToNumpy(['feature1', 'feature2'])
numpy_array = converter.transform(df)
print(numpy_array)
NumpyToVector
NumpyToVector
converts a NumPy array to a Spark DataFrame with vectorized features.
Parameters:
- np_object: NumPy array to convert.
- label_t: Label type column. Default is
1
.
Example:
from ophelia_spark.ml.feature_miner import NumpyToVector
import numpy as np
np_array = np.array([[0.0, 1.0], [1.0, 0.0], [0.2, 0.1]])
converter = NumpyToVector(spark.sparkContext)
df = converter.transform(np_array)
df.show()
🛠️ Utility Functions
Gini Score
Computes the Gini index for a given node.
Parameters:
- node: Dictionary representing the node.
Example:
node = {'class1': 10, 'class2': 30}
score = converter.gini_score(node)
print(score)
Entropy Score
Computes the entropy for a given node.
Parameters:
- node: Dictionary representing the node.
Example:
node = {'class1': 10, 'class2': 30}
score = converter.entropy_score(node)
print(score)
Information Gain
Computes the information gain for a parent node and its children.
Parameters:
- parent: Dictionary representing the parent node.
- children: List of dictionaries representing the children nodes.
- criterion: Criterion for computing the gain ('gini' or 'entropy').
Example:
parent = {'class1': 10, 'class2': 30}
children = [{'class1': 5, 'class2': 10}, {'class1': 5, 'class2': 20}]
gain = converter.information_gain(parent, children, 'gini')
print(gain)
Updated 13 days ago