The activation function is a fundamental component of machine learning (ML) and neural networks. By introducing nonlinearity into the neural network, these functions enable the model to learn complicated relationships between inputs and outputs. The tanh (hyperbolic tangent) function, which has been widely applied because of its qualities that support training deep neural networks, is one of the most often employed activation functions. This paper will investigate the tanh function in depth, looking at its mathematical definition, operation in machine learning, Tanh vs Sigmoid, and strengths and disadvantages.
What is Tanh Activation Function?
Sigmoid functions like the tanh function map input values to -1 to 1. Its mathematical definition is:
tanh ( ???? ) = (???????? − ????−???? ) / (???????? + ????−????)
Where:
- The exponential function (Euler’s number increased to the power of ????) is ????????.
- The input to the function is ????.
This continuous, smooth function is important for gradient-based machine learning optimization approaches like backpropagation.
Key Properties of the Tanh Function
Range of Output: Tanh outputs between -1 and 1. A fundamental difference from the sigmoid function, which outputs values between 0 and 1, is this. The tanh function can represent more data by mapping inputs to negative and positive values, which can help identify data trends.
Symmetry: The tanh function is odd, resulting in tanh(−????)=−tanh(????). The learning process benefits from this symmetry. This symmetry helps the network learn faster during weight updates during backpropagation.
Differentiability: Like other activation functions, tanh is differentiable, therefore we can calculate its gradient. Machine learning relies on this fact to optimize weights using gradient-based approaches like stochastic gradient descent.
Smooth gradient: The tanh function gradient is:
d/d????(tanh(????))= 1 – tanh2(????).
The model learns well since the gradient is constantly positive and smoothly transitions between 0 and 1.
Tanh Function in Neural Network
Layers of neurons are weighted in neural networks. Tanh (or any activation function) introduces non-linearity in these levels. No matter its layers, a neural network would be a linear model without non-linearity. This would drastically limit the network’s complicated pattern learning.
Applying tanh to each neuron lets the network approximate complex, non-linear functions. The neural network models complex input-output relationships by passing each neuron’s output to the next layer through non-linear transformations.
Advantages of Tanh in Machine Learning
- Centering Around Zero: Tanh outputs values between -1 and 1, unlike sigmoid, which outputs 0 and 1. This is one of its main advantages. This centers the output mean around zero, making training easier. Centering data reduces the possibility of the network’s gradient bursting or vanishing, which is typical with sigmoid activation functions.
- Better Gradient Flow: The tanh function outputs both positive and negative values, which helps balance gradients during backpropagation. This balanced gradient flow avoids gradient descent issues with vanishing gradients, making it a more stable deep neural network training method.
- Efficient Learning: The tanh function makes neural network learning more effective due to its smooth and continuous nature. Tanh’s smooth gradient allows minor, consistent weight adjustments during training, stabilizing convergence.
Disadvantages of Tanh Activation Function
The tanh function offers several benefits but also some drawbacks.
- Vanishing Gradient Problem: Tanh suffers from the vanishing gradient problem despite its benefits. The tanh function saturates and has a tiny gradient when the inputs are large (positive or negative). Gradients diminish as they travel through layers during backpropagation, leading the network to stop or learn slowly.
For example, when the input ???? is far from zero, the gradient 1−tanh2(????) becomes very small. This can cause the network to learn very slowly, especially in deep networks, as the updates to the weights become insignificant.
- Computationally Expensive: Exponentials make the tanh function more expensive than multiplication and addition. In some real-time or low-resource applications, this issue remains despite current hardware and optimization strategies.
- Not Well-suited for Sparse Data: The tanh function outputs values between -1 and 1, therefore it may not be optimal for sparse data that would benefit from non-saturating activation functions like the ReLU.
Sigmoid vs Tanh vs Relu
Tanh is a prominent activation function, but machine learning and neural networks employ others. The sigmoid and ReLU functions stand out.
Sigmoid:
- Sigmoid maps input values to 0–1, making it suited for probabilistic outputs like binary classification.
- Like tanh, sigmoid has a non-zero output because to the vanishing gradient problem. This can slow training convergence.
ReLU (Rectified Linear Unit):
- ReLU is a popular deep learning activation function due to its computational simplicity (max(0, ????)).
- ReLU does not have the vanishing gradient problem like tanh, but it can have the dying ReLU problem, where neurons can “die” (i.e., become inactive and output 0) if the weights are always negative.
Tanh vs Sigmoid

- Tanh produces -1 to 1 values, while sigmoid produces 0 to 1. To avoid biased gradients, tanh is better when outputs must be centered around zero.
- Tanh learns faster and better than sigmoid because its gradient for inputs approaching 0 is stronger.
Tanh and sigmoid functions are both nonlinear activation functions utilized in neural networks, however they have different output ranges and behaviors. The sigmoid function produces values between 0 and 1, making it excellent for tasks like as binary classification that require probabilities; yet, it suffers from the vanishing gradient problem, particularly for extreme input values. In contrast, tanh produces values between -1 and 1, making it zero-centered, which aids in faster convergence during training by avoiding gradient saturation to a smaller amount than sigmoid.
Despite this benefit, both functions can experience vanishing gradients for large inputs; however, tanh is frequently chosen in hidden layers due to its superior gradient flow and zero-centered output, whilst sigmoid is commonly employed in output layers where probabilities or binary outcomes are required.
Tanh vs Relu
ReLU is better for deep neural networks than tanh since it is less computationally expensive and has less vanishing gradient problem. However, tanh can output numbers centered about zero, which can be useful.
Conclusion
In neural networks, the tanh function is crucial for generating zero-centered output. Its smooth gradient and ability to translate values to negative and positive numbers make it suited for many machine learning tasks. Unfortunately, the vanishing gradient problem can hamper deep network training.
Tanh is a significant activation function in neural network history and is utilized in many machine learning applications, even though it may not always be the optimal choice. Understanding tanh’s strengths and weaknesses can help choose the right activation function for a task as deep learning evolves.