Common Machine Learning Notation Standards
Data and Dimensions
- m - Number of training examples
- n - Number of features/dimensions
- X - Input matrix/feature matrix (usually ∈ ℝ^(m×n))
- x - Single input vector or feature
- x^(i) - The i-th training example
- x_j - The j-th feature
- x_j^(i) - The j-th feature of the i-th example
- y - Output/target/label vector
- y^(i) - Label for the i-th training example
- ŷ (y-hat) - Predicted output
Model Parameters
- θ (theta) - Parameter vector (common in classical ML)
- w or W - Weight matrix or vector
- b - Bias term
- β (beta) - Regression coefficients (statistics convention)
- α (alpha) - Learning rate
- λ (lambda) - Regularization parameter
- ε (epsilon) - Small value for numerical stability or error threshold
Functions and Operations
- h(x) or h_θ(x) - Hypothesis function
- f(x) - General function or true underlying function
- g(x) - Often activation function
- J(θ) - Cost/objective function
- L - Loss function (for single example)
- ℓ - Alternative loss notation
- ∇ (nabla) - Gradient operator
- σ (sigma) - Sigmoid function or standard deviation
- Σ - Summation
Neural Network Specific
- L - Number of layers
- n^[l] - Number of units in layer l
- W^[l] - Weight matrix for layer l
- b^[l] - Bias vector for layer l
- a^[l] - Activations for layer l
- z^[l] - Pre-activation values for layer l
- δ (delta) - Error term in backpropagation
Probability and Statistics
- P(x) - Probability of x
- p(x) - Probability density/mass function
- 𝔼[X] - Expected value of X
- μ (mu) - Mean
- σ² - Variance
- Σ - Covariance matrix (context-dependent)
- 𝒩(μ, σ²) - Normal distribution
- ~ - “Distributed as” (e.g., X ~ 𝒩(0,1))
Optimization
- t - Time step or iteration number
- η (eta) - Learning rate (alternative to α)
- ρ (rho) - Momentum coefficient or discount factor
- γ (gamma) - Discount factor (reinforcement learning)
- ∂ - Partial derivative
Evaluation and Splits
- X_train, X_test, X_val - Training, test, validation sets
- k - Number of clusters (k-means) or folds (k-fold CV)
- K - Number of classes in classification
- TP, FP, TN, FN - True/False Positives/Negatives
- ACC - Accuracy
- MSE - Mean Squared Error
- RMSE - Root Mean Squared Error
Matrix Operations
- X^T - Transpose of X
- X^{-1} - Inverse of X
- ⊙ - Element-wise multiplication (Hadamard product)
- · or ⟨,⟩ - Dot product
- ||x|| - Norm of x (often L2 unless specified)
- I - Identity matrix
Special Notation
- 1{condition} - Indicator function (1 if true, 0 if false)
- argmax - Argument that maximizes
- argmin - Argument that minimizes
- log - Natural logarithm (unless log base specified)
- exp - Exponential function
- 𝟙 - Vector of ones
- 𝟎 - Vector of zeros
Indexing Conventions
- i - Typically indexes training examples (1 to m)
- j - Typically indexes features (1 to n)
- k - Typically indexes output classes
- l - Typically indexes layers in neural networks
- t - Typically indexes time steps
Set Notation
- 𝒟 - Dataset
- ℝ - Real numbers
- ℝ^n - n-dimensional real space
- ∈ - “Element of”
- ⊂ - “Subset of”