Dataset Generators
The jasmine.datasets module provides synthetic data generators for testing and examples.
Functions
|
Generate a random regression problem with JAX. |
|
Generate a random n-class classification problem with. |
|
Generate a polynomial regression problem with one feature. |
generate_regression
- jasmine.datasets.generate_regression(n_samples=100, n_features=20, n_informative=10, noise=0.0, bias=0.0, shuffle=True, coef=False, random_state=None)[source]
Generate a random regression problem with JAX.
This function creates a dataset where the output is a linear combination of a subset of the input features, with optional Gaussian noise.
- Parameters:
n_samples (int) – The number of samples to generate.
n_features (int) – The total number of features.
n_informative (int) – The number of features that are actually used to generate the output. The rest are noise.
noise (float) – The standard deviation of the Gaussian noise added to the output.
bias (float) – The bias term (intercept) in the underlying linear model.
shuffle (bool) – Whether to shuffle the features and informative indices. If False, the informative features will always be the first n_informative columns.
coef (bool) – If True, the ground truth coefficients and bias are returned.
random_state (int, optional) – Seed for the random number generator for reproducibility. If None, a random seed is used.
- Returns:
- By default, returns (X, y).
If coef is True, returns (X, y, ground_truth_coefficients).
- Return type:
generate_classification
- jasmine.datasets.generate_classification(n_samples: int = 100, n_features: int = 20, n_informative: int = 5, n_redundant: int = 2, n_classes: int = 2, class_sep: float = 1.0, feature_noise: float = 1.0, redundant_noise: float = 0.0, shuffle: bool = True, random_state: int | None = None) Tuple[Array, Array][source]
Generate a random n-class classification problem with.
This function creates clusters of points normally distributed around vertices of a hypercube, making it suitable for testing classification algorithms.
- Parameters:
n_samples – The number of samples.
n_features – The total number of features.
n_informative – The number of informative features.
n_redundant – The number of redundant features (linear combinations of informative features).
n_classes – The number of classes (or labels).
class_sep – Factor multiplying the hypercube size. Larger values spread out the classes and make the problem easier.
shuffle – Whether to shuffle the features.
random_state – Seed for the random number generator.
- Returns:
A tuple (X, y) where X is the feature matrix and y are the integer labels.
generate_polynomial
- jasmine.datasets.generate_polynomial(n_samples: int = 100, degree: int = 2, noise: float = 0.0, bias: float = 0.0, coef: bool = False, random_state: int | None = None)[source]
Generate a polynomial regression problem with one feature.
- Parameters:
n_samples – The number of samples.
degree – The degree of the polynomial relationship.
noise – The standard deviation of the Gaussian noise.
bias – The bias term (intercept).
coef – If True, the ground truth coefficients and bias are returned.
random_state – Seed for the random number generator.
- Returns:
By default, returns (X, y). X will have a shape of (n_samples, 1). If coef is True, returns (X, y, ground_truth_coefficients, bias).
Examples
Regression Data
from jasmine.datasets import generate_regression
import jax.numpy as jnp
# Basic regression dataset
X, y = generate_regression(
n_samples=1000,
n_features=20,
n_informative=5,
noise=0.1,
random_state=42
)
print(f"Data shape: {X.shape}")
print(f"Target range: [{jnp.min(y):.2f}, {jnp.max(y):.2f}]")
# With ground truth coefficients
X, y, coef = generate_regression(
n_samples=500,
n_features=10,
coef=True,
random_state=42
)
print(f"True coefficients: {coef}")
Classification Data
from jasmine.datasets import generate_classification
# Binary classification
X, y = generate_classification(
n_samples=1000,
n_features=20,
n_informative=10,
n_classes=2,
class_sep=1.0,
random_state=42
)
print(f"Data shape: {X.shape}")
print(f"Class distribution: {jnp.bincount(y)}")
# Multi-class classification
X, y = generate_classification(
n_samples=1500,
n_features=15,
n_classes=3,
n_informative=8,
random_state=42
)
Polynomial Data
from jasmine.datasets import generate_polynomial
# Polynomial regression dataset
X, y = generate_polynomial(
n_samples=500,
degree=3,
noise=0.1,
random_state=42
)
print(f"Polynomial data shape: {X.shape}")
Usage Tips
Use
random_statefor reproducible datasetsAdjust
noiseparameter to control data difficultyn_informativecontrols how many features actually affect the targetclass_sepin classification controls how well-separated the classes areAll generators return JAX arrays for seamless integration