IRIS Web Services Data Quality Metrics Exercise#

:auth: Nate Stevens (Pacific Northwest Seismic Network) :email: ntsteven@uw.edu
:org: Pacific Northwest Seismic Network
:license: GPLv3
:purpose:
In this notebook we’ll query data quality metrics from the MUSTANG measurements webservice
and the FDSNWS availability webservice provided by EarthScope/SAGE to get a sense of data
availability and usefullness BEFORE downloading a ton of data!

What is MUSTANG? - A continually growing data quality statistics dataset
for every seismic station stored on the Data Management Center!

What does MUSTANG stand for? - The Modular Utility for STAtistical kNowldege Gathering system

Where can I go to learn more about MUSTANG? https://service.iris.edu/mustang/

Dependencies for this Notebook:

  • ObsPy

  • ws_client (ws_client.py)

## IMPORT MODULES
import os
from pathlib import Path
import pandas as pd
from obspy import UTCDateTime
from obspy.clients.fdsn import Client
# Tools for data visualization
import matplotlib.pyplot as plt

# Custom-Built Clients for fetching data quality measurements from IRIS web services
from ws_client import MustangClient, AvailabilityClient
# Define absolute paths
ROOT = Path.cwd()
print(f'ROOT dir is {ROOT}')
WFQC = ROOT/'wave_qc_files'
os.makedirs(str(WFQC), exist_ok=True)
ROOT dir is /Users/nates/Code/GitHub/2025_ML_TSC/notebooks/Nate

Composing a MUSTANG query#

The MustangClient class’ get_metrics follows the key=value syntax of the MUSTANG measurements service interface
(https://service.iris.edu/mustang/measurements/1/)
where multiple values can be provided as a comma-delimited string.

This version of the MustangClient can also parse lists of metric names (see below).

The full list of MUSTANG metrics and detailed descriptions of their meaning can be found at the link above.

The metrics we’ll use in this exercise are:

  • sample_min: the minimum sample value observed in a 24 hour period

  • max_range: the maximum range between any two samples in a 5 minute window within a 24 hour period

  • percent_availability: the percent of a 24 hour period for which there are data

  • sample_unique: number of unique sample values reported in a 24 hour window

  • num_gaps: number of data gaps encountered within a 24 hour window

The seismic station we’re looking at is UW.MBW.01.EHZ, one of the longest running stations in the PNSN that
monitored Mount Baker volcano until late 2023 when it was replaced with UW.MBW2.

UW.MBW was having some issues towards the end of its life, can you find spots where it looks like the data might not be as useful?#

# Initialize the client
mclient = MustangClient()
# Compose a query for MUSTANG metrics for an analog seismometer near Mount Baker (Washington, USA)
metric = ['sample_min','max_range','percent_availability','sample_unique','num_gaps']
query = {'metric': metric,
            'net':'UW',
            'sta':'MBW',
            'loc':'*',
            'cha':'EHZ'}
# Run query (sometimes this requires a few start/stops - something to do with `requests` internal workings)
df_m = mclient.measurements_request(**query)
# Write to disk ()
df_m.to_csv(WFQC/'UW.MBW_MUSTANG_metrics.csv', header=True, index=True)
# Just in case we need to load from disk because `requests` is struggling
try:
    display(df_m)
except NameError:
    df_m = pd.read_csv(WFQC/'UW.MBW_MUSTANG_metrics.csv', index_col=['start','target'], parse_dates=['start'])
    display(df_m)
max_range num_gaps sample_unique sample_min percent_availability
start target
2025-02-11 UW.MBW.01.EHZ.M 3399.0 3.0 22.0 0.0 94.203
2025-02-10 UW.MBW.01.EHZ.M 3380.0 3.0 27.0 0.0 99.985
2025-02-09 UW.MBW.01.EHZ.M 3379.0 3.0 14.0 0.0 99.739
2025-02-08 UW.MBW.01.EHZ.M 3381.0 3.0 17.0 0.0 99.993
2025-02-07 UW.MBW.01.EHZ.M 3394.0 6.0 23.0 0.0 99.990
... ... ... ... ... ... ...
1987-03-20 UW.MBW..EHZ.M NaN NaN NaN NaN 0.000
1987-01-16 UW.MBW..EHZ.M NaN NaN NaN NaN 0.000
1987-01-14 UW.MBW..EHZ.M NaN NaN NaN NaN 0.000
1987-01-26 UW.MBW..EHZ.M NaN NaN NaN NaN 0.000
1987-01-27 UW.MBW..EHZ.M NaN NaN NaN NaN 0.000

17093 rows × 5 columns

fig = plt.figure(figsize=(8,12))
gs = fig.add_gridspec(nrows=len(df_m.columns), hspace=0)

for _e, _c in enumerate(df_m.columns):
    ax = fig.add_subplot(gs[_e])
    ax.plot(df_m.index.get_level_values(0), df_m[_c].values, '.', label=_c)
    ax.set_ylabel(_c)
    ax.grid()
../../_images/c0d5c3672ca6ebaecd9557e6bebf2fe44c5cfba3ccd0e600e03b61927a201c01.png

Lots of gaps#

Trying to bulk download gappy data from webservices can result in the entire request crashing.

If we can request with information on data availability (and which data seem to have meaning) then this job becomes easier.

Thankfully data availability is already documented by the NSF SAGE Facility FDSN Web Service!

Running a FDSN Web Service query#

Use the custom-built AvailabilityClient class that follows the syntax of the
related webservice: https://service.iris.edu/fdsnws/availability/1/

For this example we’ll keep looking at station UW.MBW.

# Initialize the client
aclient = AvailabilityClient()
# Run a data availability request for everything UW.MBW.*.EHZ has to offer
df_a = aclient.availability_request(sta='MBW',net='UW',cha='EHZ')
# Write to disk
df_a.to_csv(WFQC/'UW.MBW_Availability.csv', header=True, index=True)
# Take a look at the query results (and an option to read from disk for exercise expediancy)
try:
    display(df_a)
except NameError:
    df_m = pd.read_csv(WFQC/'UW.MBW_Availability.csv', index_col=[0], parse_dates=['Earliest','Latest'])
    display(df_a)
Network Station Location Channel Quality SampleRate Earliest Latest
0 UW MBW EHZ M 104.082 1980-01-11 21:40:16.543800+00:00 1980-01-11 21:41:08.185791+00:00
1 UW MBW EHZ M 104.084 1980-01-15 05:34:14.273800+00:00 1980-01-15 05:34:53.617023+00:00
2 UW MBW EHZ M 104.079 1980-01-17 05:39:19.142000+00:00 1980-01-17 05:40:03.406475+00:00
3 UW MBW EHZ M 104.082 1980-01-25 22:40:03.229300+00:00 1980-01-25 22:40:54.871291+00:00
4 UW MBW EHZ M 104.069 1980-01-27 23:41:48.775700+00:00 1980-01-27 23:42:15.825070+00:00
... ... ... ... ... ... ... ... ...
91602 UW MBW 01 EHZ M 100.000 2025-02-10 08:00:03+00:00 2025-02-10 08:00:03.990000+00:00
91603 UW MBW 01 EHZ M 100.000 2025-02-10 08:00:04.200000+00:00 2025-02-10 08:00:33.190000+00:00
91604 UW MBW 01 EHZ M 100.000 2025-02-10 08:00:45.400000+00:00 2025-02-11 08:00:01.390000+00:00
91605 UW MBW 01 EHZ M 100.000 2025-02-11 08:00:06.100000+00:00 2025-02-11 08:01:02.090000+00:00
91606 UW MBW 01 EHZ M 100.000 2025-02-11 08:01:03+00:00 2025-02-11 22:36:36.990000+00:00

91607 rows × 8 columns

What is going on with the sampling rates?#

_series = pd.Series(df_a.SampleRate.values, index=df_a.Earliest.values)
ax = _series.plot()
ax.set_ylabel('Sampling Rate [sps]')
Text(0, 0.5, 'Sampling Rate [sps]')
../../_images/3c2c4c0df2d97ee2ccf61400837f16f1e0c08ee27751edebd7b88d6e15fbe379.png

Now lets’ finally look at some data!#

# Get an obspy client for fetching waveforms
wclient = Client('IRIS')
# Subset available segments to a time window (currently uses pandas Timestamp objects)
_df_a = df_a[(df_a.Earliest >= pd.Timestamp('2023-02-19',tz='UTC')) & (df_a.Latest <= pd.Timestamp('2023-02-21',tz='UTC'))]
display(_df_a)
Network Station Location Channel Quality SampleRate Earliest Latest
86172 UW MBW 01 EHZ M 100.0 2023-02-19 02:18:28.300000+00:00 2023-02-19 02:19:31.290000+00:00
86173 UW MBW 01 EHZ M 100.0 2023-02-19 02:19:32.400000+00:00 2023-02-19 08:00:02.390000+00:00
86174 UW MBW 01 EHZ M 100.0 2023-02-19 08:00:06.300000+00:00 2023-02-19 12:01:02.290000+00:00
86175 UW MBW 01 EHZ M 100.0 2023-02-19 12:01:03.200000+00:00 2023-02-20 01:02:49.190000+00:00
86176 UW MBW 01 EHZ M 100.0 2023-02-20 01:02:50.100000+00:00 2023-02-20 02:45:06.090000+00:00
86177 UW MBW 01 EHZ M 100.0 2023-02-20 02:45:06.200000+00:00 2023-02-20 02:46:26.190000+00:00
86178 UW MBW 01 EHZ M 100.0 2023-02-20 02:46:27.200000+00:00 2023-02-20 08:00:01.190000+00:00
86179 UW MBW 01 EHZ M 100.0 2023-02-20 08:00:05.800000+00:00 2023-02-20 08:00:43.790000+00:00
86180 UW MBW 01 EHZ M 100.0 2023-02-20 08:00:43.900000+00:00 2023-02-20 11:02:01.890000+00:00
86181 UW MBW 01 EHZ M 100.0 2023-02-20 11:05:19.900000+00:00 2023-02-20 13:40:36.890000+00:00
86182 UW MBW 01 EHZ M 100.0 2023-02-20 13:40:37+00:00 2023-02-20 13:42:52.990000+00:00
86183 UW MBW 01 EHZ M 100.0 2023-02-20 13:42:53.100000+00:00 2023-02-20 16:10:36.090000+00:00
# Compose a bulk request
bulk = []
for _, row in _df_a.iterrows():
    # Switch pandas Timestamp objects back to UTCDateTime objects for requests
    req = (row.Network, row.Station, row.Location, row.Channel, UTCDateTime(row.Earliest.timestamp()), UTCDateTime(row.Latest.timestamp()))
    bulk.append(req)
display(bulk)
[('UW',
  'MBW',
  '01',
  'EHZ',
  2023-02-19T02:18:28.300000Z,
  2023-02-19T02:19:31.290000Z),
 ('UW',
  'MBW',
  '01',
  'EHZ',
  2023-02-19T02:19:32.400000Z,
  2023-02-19T08:00:02.390000Z),
 ('UW',
  'MBW',
  '01',
  'EHZ',
  2023-02-19T08:00:06.300000Z,
  2023-02-19T12:01:02.290000Z),
 ('UW',
  'MBW',
  '01',
  'EHZ',
  2023-02-19T12:01:03.200000Z,
  2023-02-20T01:02:49.190000Z),
 ('UW',
  'MBW',
  '01',
  'EHZ',
  2023-02-20T01:02:50.100000Z,
  2023-02-20T02:45:06.090000Z),
 ('UW',
  'MBW',
  '01',
  'EHZ',
  2023-02-20T02:45:06.200000Z,
  2023-02-20T02:46:26.190000Z),
 ('UW',
  'MBW',
  '01',
  'EHZ',
  2023-02-20T02:46:27.200000Z,
  2023-02-20T08:00:01.190000Z),
 ('UW',
  'MBW',
  '01',
  'EHZ',
  2023-02-20T08:00:05.800000Z,
  2023-02-20T08:00:43.790000Z),
 ('UW',
  'MBW',
  '01',
  'EHZ',
  2023-02-20T08:00:43.900000Z,
  2023-02-20T11:02:01.890000Z),
 ('UW',
  'MBW',
  '01',
  'EHZ',
  2023-02-20T11:05:19.900000Z,
  2023-02-20T13:40:36.890000Z),
 ('UW',
  'MBW',
  '01',
  'EHZ',
  2023-02-20T13:40:37.000000Z,
  2023-02-20T13:42:52.990000Z),
 ('UW',
  'MBW',
  '01',
  'EHZ',
  2023-02-20T13:42:53.100000Z,
  2023-02-20T16:10:36.090000Z)]
# Run bulk request
st = wclient.get_waveforms_bulk(bulk)
st.plot()
../../_images/b0727544efb56e98aadbe096dd6a8217822f3643ec6c11d23ffb0ddb0b715039.png ../../_images/b0727544efb56e98aadbe096dd6a8217822f3643ec6c11d23ffb0ddb0b715039.png

It looks like we have continuous data…#

Are they continuous?#

Where do they stop being useful?#

What does this mean for your workflow?#