In statistics, a power transform is a family of functions that are applied to create a monotonic transformation of data using power functions. This is a useful data transformation technique used to stabilize variance, make the data more normal distribution-like, improve the validity of measures of association such as the Pearson correlation between variables and for other data stabilization procedures.

Let’s first have a look at the Box-Cox transformation.

Box-Cox transformation

Assume we have a collection of bivariate data \(\mathcal{D} = {x_i, y_i}_{i = 1}^n\), we want to explore the relationship between \(x\) and \(y\). If by selecting \(\lambda\) properly, we have simple linear relationship in the form

or

Then, we may consider changing the measurement scale for the rest of the statistical analysis. Following is the Tukey’s ladder of transformation:

The goal is to find a value of λ that makes the scatter diagram of transformed \(\mathcal{D}\) as linear as possible. One approach might be to fit a straight line to the transformed points and try to minimize the residuals. However, an easier approach is based on the fact that the correlation coefficient, \(r\), is a measure of the linearity of a scatter diagram. In particular, if the points fall on a straight line then their correlation will be \(r = 1\). (We need not worry about the case when \(r = −1\) since we have defined the Tukey transformed variable \(x^\lambda\) to be positively correlated with \(x\) itself.)

Later, the Box-Cox transformation was introduced as

When \(\lambda \neq 0\), this formula keeps the order of the transformed data; when \(\lambda = 0\), with L’Hopital’s rule, we have \(\lim_{\lambda \rightarrow 0}x_\lambda = \log x\).

We apply the same rule to find the best \(\lambda\) as in the Tukey’s transformation.

In regression analysis, for the model

and the fitted model

each of the predictor variable \(x_j\) can be transformed. The usual criterion is the variance of the residuals, given by

Occasionally, the response variable \(y\) may be transformed. In this case, the variance of the residuals is not comparable as the \(\lambda\) varies. The transformed response is defined as

where \(\bar{g}_y = (\prod_{i = 1}^ny_i)^{\frac{1}{n}}\) is the geometric mean of the response variable.

When \(\lambda = 0\), \(y_0 = \bar{g}_y\log y\).

Power transformation

Here is the definition of power transformation, for data vectors \((y_1, y_2, \dots, y_n)\) in which each \(y_i \gt 0\), the power transformation is

where \(GM(y) = (y_1 \cdot y_2 \cdot \dots \cdot y_n)^{\frac{1}{n}}\) is the geometric mean of the observations.

Following is from wiki, not quite understood, especially the part of the definition of Jacobian.

With Jacobian

then the normal log-likelihood at the maximum is

Absorb \(GM(y)^{2(\lambda -1)}\) into the expression for \(\hat{\sigma}^2\) produces an expression that establishes that minimizing the sum of squares of residuals from \(y_i^\lambda\) is equivalent to maximizing the sum of the normal log likelihood of deviations from \(\frac{y^\lambda-1}{\lambda}\) and the log of the Jacobian of the transformation.

The value at \(y = 1\) for any \(\lambda\) is \(0\), and the derivative with respect to \(y\) there is \(1\) for any \(\lambda\). Sometimes \(y\) is a version of some other variable scaled to give \(y = 1\) at some sort of average value.

Reference

This site is pretty good for a starter.