Ophelian Generic Examples

📚 Reading Data

The Read class implements Spark reading objects in multiple formats: {'csv', 'parquet', 'excel', 'json'}

from ophelian.read.spark_read import Read
spark_df = spark.readFile(path, 'csv', header=True, infer_schema=True)

Viewing DataFrame Shape

Use the Shape class from the functions module to view the dimensions of your Spark DataFrame, similar to NumPy:

from ophelian.functions import Shape

dic = {
    'Product': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
    'Year': [2010, 2010, 2010, 2011, 2011, 2011, 2012, 2012, 2012],
    'Revenue': [100, 200, 300, 110, 190, 320, 120, 220, 350]
}
dic_to_df = spark.createDataFrame(pd.DataFrame(data=dic))
dic_to_df.show(10, False)

+---------+------------+-----------+
| Product |    Year    |  Revenue  |
+---------+------------+-----------+
|    A    |    2010    |    100    |
|    B    |    2010    |    200    |
|    C    |    2010    |    300    |
|    A    |    2011    |    110    |
|    B    |    2011    |    190    |
|    C    |    2011    |    320    |
|    A    |    2012    |    120    |
|    B    |    2012    |    220    |
|    C    |    2012    |    350    |
+---------+------------+-----------+

dic_to_df.Shape
(9, 3)

The pct_change wrapper is added to the Spark DataFrame class in order to have the most commonly used method in Pandas
objects to get the relative percentage change from one observation to another, sorted by a date-type column and lagged by a numeric-type column.

from ophelian.functions import PctChange

dic_to_df.pctChange().show(10, False)

+---------------------+
| Revenue             |
+---------------------+
| null                |
| 1.0                 |
| 0.5                 |
| -0.6333333333333333 |
| 0.7272727272727273  |
| 0.6842105263157894  |
| -0.625              |
| 0.8333333333333333  |
| 0.5909090909090908  |
+---------------------+

Another option is to configure all receiving parameters from the function, as follows:

periods; this parameter will control the offset of the lag periods. Since the default value is 1, this will always return a lag-1 information DataFrame.
partition_by; this parameter will fix the partition column over the DataFrame, e.g. 'bank_segment', 'assurance_product_type'.
order_by; order by parameter will be the specific column to order the sequential observations, e.g. 'balance_date', 'trade_close_date', 'contract_date'.
pct_cols; percentage change col (pct_cols) will be the specific column to lag-over giving back the relative change between one element to other, e.g. 𝑥𝑡 ÷ 𝑥𝑡 − 1

In this case, we will specify only the periods parameter to yield a lag of -2 days over the DataFrame.

dic_to_df.pctChange(periods=2).na.fill(0).show(5, False)

+--------------------+
|Revenue             |
+--------------------+
|0.0                 |
|0.0                 |
|2.0                 |
|-0.44999999999999996|
|-0.3666666666666667 |
+--------------------+
only showing top 5 rows

Adding parameters: partition_by, order_by & pct_cols

dic_to_df.pctChange(partition_by="Product", order_by="Year", pct_cols="Revenue").na.fill(0).show(5, False)

+---------------------+
|Revenue              |
+---------------------+
|0.0                  |
|-0.050000000000000044|
|0.1578947368421053   |
|0.0                  |
|0.06666666666666665  |
+---------------------+
only showing top 5 rows

You may also lag more than one column at a time by simply adding a list with string column names:

dic_to_df.pctChange(partition_by="Product", order_by="Year", pct_cols=["Year", "Revenue"]).na.fill(0).show(5, False)

+--------------------+---------------------+
|Year                |Revenue              |
+--------------------+---------------------+
|0.0                 |0.0                  |
|4.975124378110429E-4|-0.050000000000000044|
|4.972650422674363E-4|0.1578947368421053   |
|0.0                 |0.0                  |
|4.975124378110429E-4|0.06666666666666665  |
+--------------------+---------------------+
only showing top 5 rows