peakweather.dataset¶
- class PeakWeatherDataset(root: str | None = None, pad_missing_variables: bool = True, years: int | Sequence[int] | None = None, parameters: str | Sequence[str] | None = None, extended_topo_vars: str | Sequence[str] | None = 'none', extended_nwp_vars: str | Sequence[str] | None = 'none', imputation_method: Literal['locf', 'zero', None] = 'zero', interpolation_method: str = 'nearest', freq: str | None = None, compute_uv: bool = True, station_type: Literal['rain_gauge', 'meteo_station'] | None = None, aggregation_methods: dict[str, str] | None = None)¶
Bases:
objectPeakWeather is a high-quality meteorological dataset derived from SwissMetNet, the automated measurement network operated by MeteoSwiss. It offers a robust resource for research and applications in spatiotemporal modeling.
PeakWeather includes high-frequency meteorological observations recorded every 10 minutes, collected from 302 ground stations distributed across Switzerland, covering the period from January 1, 2017 to March 31, 2025. The dataset also provides high-resolution topographic features at 50-meter resolution and ensemble forecasts from the ICON-CH1-EPS operational numerical weather prediction (NWP) model. The dataset is described in more details in “PeakWeather: MeteoSwiss Weather Station Measurements for Spatiotemporal Deep Learning” (Zambon et al., 2025).
This class loads and reads the PeakWeather dataset, providing utilities for accessing, preprocessing, and integrating the data into machine learning workflows.
- Dataset size:
Time steps: 433728
Stations: 302
Channels: 8
Sampling interval: 10 minutes
- Channels:
wind_direction: Wind direction (degree). Ten minutes mean.wind_speed: Wind speed scalar (meter/second). Ten minutes mean.wind_gust: Gust peak (meter/second). Maximum recorded over ten minutes.pressure: Atmospheric pressure at barometric altitude (QFE) (hectopascal). Instant value.precipitation: Precipitation (millimeter). Ten minutes total.sunshine: Sunshine duration (minute). Ten minutes total.temperature: Air temperature 2 m above ground (degree Celsius). Instant value.humidity: Relative air humidity 2 m above ground (per cent). Instant value.
- Static attributes:
stations_table: Information associated with the stations, includingname, type, latitude, longitude, height, and topographical descriptors.
installation_table: Information about stations’ installation.parameters_table: Description of the quantities measured.
- Parameters:
root (str, optional) – The root directory where the dataset is stored. If
None, the dataset is stored in the current working directory. (default:None)pad_missing_variables (bool, optional) – If
True, pad missing variables with NaN values. (default:True)years (int or list of int, optional) – The years to include in the dataset. If
None, all available years are included. (default:None)extended_topo_vars (str or list of str, optional) – The topography variables to include in the dataset. If
None, no topography variables are included. (default:"none")extended_nwp_vars (str or list of str, optional) – The NWP (ICON-CH1-EPS) variables to include in the dataset. If
None, no NWP variables are included. (default:"none")imputation_method (str, optional) – The method to use for imputing missing values. Options are “locf” (last observation carried forward), “zero” (fill with zero), or
None(no imputation). (default:"zero")interpolation_method (str, optional) – The method to use for interpolating topography variables. Options are “linear”, “nearest”, “quadratic”, “cubic”, “barycentric”, “krogh”, “akima”, or “makima”. (default:
"nearest")freq (str, optional) – The frequency to resample the dataset to. If
None, no resampling is applied. (default:None)compute_uv (bool) – Whether the u-v components of the wind should be computed and
(default (included in the dataset.) – True)
station_type (str, optional) – The type of stations to consider, either meteorological stations or rain gauges. If not defined, all stations will be included. (default:
None)aggregation_methods (dict, optional) – If given allows to apply a different aggregation than the default one to the specified parameters. The dictionary must map the parameter string name to one of
"mean","max","sum","last". (default:None)
- available_icon = {'ew_wind', 'humidity', 'nw_wind', 'precipitation', 'pressure', 'sunshine', 'temperature', 'wind_gust'}¶
- available_parameters = {'humidity': 'ure200s0', 'precipitation': 'rre150z0', 'pressure': 'prestas0', 'sunshine': 'sre000z0', 'temperature': 'tre200s0', 'wind_direction': 'dkl010z0', 'wind_gust': 'fkl010z1', 'wind_speed': 'fkl010z0'}¶
- available_topography = {'ASPECT_10000M_SIGRATIO1', 'ASPECT_2000M_SIGRATIO1', 'DEM', 'SLOPE_10000M_SIGRATIO1', 'SLOPE_2000M_SIGRATIO1', 'SN_DERIVATIVE_10000M_SIGRATIO1', 'SN_DERIVATIVE_2000M_SIGRATIO1', 'STD_10000M', 'STD_2000M', 'TPI_10000M', 'TPI_2000M', 'WE_DERIVATIVE_10000M_SIGRATIO1', 'WE_DERIVATIVE_2000M_SIGRATIO1'}¶
- available_years = {2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025}¶
- base_url = 'https://huggingface.co/datasets/MeteoSwiss/PeakWeather/resolve/0b440af7b855c68288efeb0fffb27f5ab2cc56fa/data/'¶
- get_icon_data(icon_variable: str) xr.Dataset¶
Returns an Xarray dataset with the given variable.
- Parameters:
icon_variable (str) – The ICON variable. Must be one of self.available_icon.
- Raises:
ValueError – If the corresponding zarr is not available.
- Returns:
The dataset with the ICON forecasts.
- Return type:
xr.Dataset
- get_observations(stations: str | List[str] | None = None, parameters: str | List[str] | None = None, first_date: str | Timestamp | None = None, last_date: str | Timestamp | None = None, as_numpy: bool = False, return_mask: bool = False, copy: bool = True) DataFrame | ndarray | tuple[DataFrame | ndarray, DataFrame | ndarray]¶
Get observations for the specified stations and parameters.
The observations are returned as a pandas DataFrame or numpy array, depending on the value of as_numpy. If return_mask is set to True, a tuple of (observations, mask) is returned.
The observations are filtered based on the specified stations, parameters, and date range. If no filtering is applied, all observations are returned. The date range is inclusive of the start date and exclusive of the end date.
- Parameters:
stations (str or list, optional) – Station IDs to filter. If
None, all stations are used. (default:None)parameters (str or list, optional) – Parameter IDs to filter. If
None, all parameters are used. (default:None)first_date (str or pd.Timestamp, optional) – Start date for filtering. If
None, no temporal filtering is applied. (default:None)last_date (str or pd.Timestamp, optional) – End date for filtering. If
None, no temporal filtering is applied. (default:None)as_numpy (bool, optional) – If
True, return the observations as andarrayinstead of aDataFrame. (default:False)return_mask (bool, optional) – If
True, return the mask as well. (default:False)copy (bool, optional) – If
True, return a copy of the data. (default:False)
- Returns:
- Return type:
FrameArray or tuple
- static get_uv_wind(wind_speed: ndarray, wind_direction: ndarray, direction_unit: Literal['deg', 'rad'] = 'deg') Tuple[ndarray, ndarray]¶
Computes the u,v components of the wind given wind speed and direction. The u component is the eastward component while v is the northward component.
- Parameters:
wind_speed (np.ndarray) – The wind speed.
wind_direction (np.ndarray) – The wind direction, increasing clockwise where a northerly wind has 0 degrees.
direction_unit (Literal["deg", "rad], optional) – The angle unit of measure. Defaults to “deg”.
- Returns:
Returns a tuple containing [u,v].
- Return type:
Tuple[np.ndarray]
- static get_wind_direction(u: ndarray, v: ndarray) ndarray¶
Given the u and v components, get the wind direction.
- Parameters:
u (np.ndarray) – The eastward wind component.
v (np.ndarray) – The northward wind component.
- Returns:
The wind direction.
- Return type:
np.ndarray
- static get_wind_speed(u: ndarray, v: ndarray) ndarray¶
Given the u and v components, get the wind speed.
- Parameters:
u (np.ndarray) – The eastward wind component.
v (np.ndarray) – The northward wind component.
- Returns:
The wind speed.
- Return type:
np.ndarray
- get_windows(window_size: int, horizon_size: int, stations: str | List[str] | None = None, parameters: str | List[str] | None = None, first_date: str | Timestamp | None = None, last_date: str | Timestamp | None = None) Windows¶
Get sliding windows of observations and mask. The input data is reshaped into sliding windows of size (window_size, num_stations, num_channels) and the target data is reshaped into sliding windows of size (horizon_size, num_stations, num_channels).
- Parameters:
window_size (int) – Size of the input window.
horizon_size (int) – Size of the output horizon.
stations (str or list, optional) – Station IDs to filter. If
None, all stations are used. (default:None)parameters (str or list, optional) – Parameter IDs to filter. If
None, all parameters are used. (default:None)first_date (str or pd.Timestamp, optional) – Start date for filtering. If
None, no temporal filtering is applied. (default:None)last_date (str or pd.Timestamp, optional) – End date for filtering. If
None, no temporal filtering is applied. (default:None)
- Returns:
- A tuple containing:
x (np.ndarray): Sliding windows of observations.
mask_x (np.ndarray): Sliding windows of mask.
y (np.ndarray): Sliding windows of target values.
mask_y (np.ndarray): Sliding windows of target mask.
- Return type:
- had_values_before(cutoff_time: str | Timestamp) Series¶
Returns a binary masks that informs whether a station measured a variable before the given cutoff time. This information is particularly important when the task at hand relies on an inductive or transductive learning procedure.
- Parameters:
cutoff_time (str or pd.Timestamp) – The timestamp (UTC) representing the cutoff time.
- Returns:
A binary series with a multi-index (station, variable).
- Return type:
pd.Series
- load(aggregation_methods: dict[str, str] | None = None)¶
Load the dataset.
This method downloads the dataset if it is not already present and loads the data into memory. The data is returned as a tuple containing the observations, mask, and static tables.
The observations are resampled to the specified frequency and missing values are imputed using the specified method.
The topography data is interpolated to the station locations using the specified interpolation method.
- load_raw(aggregation_methods: dict[str, str] | None = None)¶
Load the raw dataset.
This method downloads the dataset if it is not already present and loads the data into memory. The data is returned as a tuple containing the observations, static tables, and optional topography data.
- load_topography() dict¶
Load the topography data.
This method downloads the topography data if it is not already present and loads the data into memory. The data is returned as a dictionary containing the topography data for each variable.
- property missing_values: Series¶
Missing values for each parameter, considering stations equipped with the necessary sensor.
- property required_file_names: Mapping[str, str]¶
The relative filepaths that must be present in order to skip downloading.
- property required_files_paths: Mapping[str, str]¶
The absolute filepaths that must be present in order to skip downloading.
- resample(df_observations: DataFrame, df_parameters: DataFrame) DataFrame¶
Resample the observations to the specified frequency.
This method resamples the observations to the specified frequency and returns the resampled DataFrame.
- show_parameters_description()¶
Show description of the parameters in the dataset.