Note
Get this example as a Jupyter Notebook ipynb
Using dask¶
dask is a Python package build upon the scientific stack to enable scalling of Python through interactive sessions to multi-core and multi-node.
Of particular relevance to SEGY-SAK is that xrray.Dataset
loads naturally into dask
.
Imports and Setup¶
Here we import the plotting tools, numpy
and setup the dask.Client
which will auto start a localcluster
. Printing the client returns details about the dashboard link and resources.
[2]:
import numpy as np
from segysak import open_seisnc, segy
import matplotlib.pyplot as plt
%matplotlib inline
[3]:
from dask.distributed import Client
client = Client()
client
[3]:
Client
|
Cluster
|
We can also scale the cluster to be a bit smaller.
[4]:
client.cluster.scale(2, memory='0.5gb')
client
[4]:
Client
|
Cluster
|
Lazy loading from SEISNC using chunking¶
If your data is in SEG-Y to use dask it must be converted to SEISNC. If you do this with the CLI it only need happen once.
[5]:
segy_file = 'data/volve10r12-full-twt-sub3d.sgy'
seisnc_file = 'data/volve10r12-full-twt-sub3d.seisnc'
segy.segy_converter(segy_file, seisnc_file, iline=189, xline=193, cdpx=181, cdpy=185, silent=True)
Fast direction is CROSSLINE_3D
By specifying the chunks argument to the open_seisnc
command we can ask dask to fetch the data in chunks of size n. In this example the iline
dimension will be chunked in groups of 100. The valid arguments to chunks depends on the dataset but any dimension can be used.
Even though the seis of the dataset is 2.14GB
it hasn’t yet been loaded into memory, not will dask
load it entirely unless the operation demands it.
[6]:
seisnc = open_seisnc('data/volve10r12-full-twt-sub3d.seisnc', chunks={'iline':100})
seisnc.seis.humanbytes
[6]:
'40.06 MB'
Lets see what our dataset looks like. See that the variables are dask.array
. This means they are references to the on disk data. The dimensions must be loaded so dask
knows how to manage your dataset.
[7]:
seisnc
[7]:
<xarray.Dataset> Dimensions: (iline: 61, twt: 850, xline: 202) Coordinates: cdp_x (iline, xline) float32 dask.array<chunksize=(61, 202), meta=np.ndarray> cdp_y (iline, xline) float32 dask.array<chunksize=(61, 202), meta=np.ndarray> * iline (iline) int64 10090 10091 10092 10093 ... 10147 10148 10149 10150 * twt (twt) float64 4.0 8.0 12.0 16.0 ... 3.392e+03 3.396e+03 3.4e+03 * xline (xline) int64 2150 2151 2152 2153 2154 ... 2347 2348 2349 2350 2351 Data variables: data (iline, xline, twt) float32 dask.array<chunksize=(61, 202, 850), meta=np.ndarray> Attributes: coord_scalar: -100.0 measurement_system: m percentiles: [-6.97198262e+00 -6.52054033e+00 -1.49142619e+00 -5.... sample_rate: 4.0 source_file: volve10r12-full-twt-sub3d.sgy text: C 1 SEGY OUTPUT FROM Petrel 2017.2 Saturday, June 06... ns: None d3_domain: None epsg: None corner_points: None corner_points_xy: None srd: None datatype: None
- iline: 61
- twt: 850
- xline: 202
- cdp_x(iline, xline)float32dask.array<chunksize=(61, 202), meta=np.ndarray>
Array Chunk Bytes 49.29 kB 49.29 kB Shape (61, 202) (61, 202) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray - cdp_y(iline, xline)float32dask.array<chunksize=(61, 202), meta=np.ndarray>
Array Chunk Bytes 49.29 kB 49.29 kB Shape (61, 202) (61, 202) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray - iline(iline)int6410090 10091 10092 ... 10149 10150
array([10090, 10091, 10092, 10093, 10094, 10095, 10096, 10097, 10098, 10099, 10100, 10101, 10102, 10103, 10104, 10105, 10106, 10107, 10108, 10109, 10110, 10111, 10112, 10113, 10114, 10115, 10116, 10117, 10118, 10119, 10120, 10121, 10122, 10123, 10124, 10125, 10126, 10127, 10128, 10129, 10130, 10131, 10132, 10133, 10134, 10135, 10136, 10137, 10138, 10139, 10140, 10141, 10142, 10143, 10144, 10145, 10146, 10147, 10148, 10149, 10150])
- twt(twt)float644.0 8.0 12.0 ... 3.396e+03 3.4e+03
array([ 4., 8., 12., ..., 3392., 3396., 3400.])
- xline(xline)int642150 2151 2152 ... 2349 2350 2351
array([2150, 2151, 2152, ..., 2349, 2350, 2351])
- data(iline, xline, twt)float32dask.array<chunksize=(61, 202, 850), meta=np.ndarray>
Array Chunk Bytes 41.89 MB 41.89 MB Shape (61, 202, 850) (61, 202, 850) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray
- coord_scalar :
- -100.0
- measurement_system :
- m
- percentiles :
- [-6.97198262e+00 -6.52054033e+00 -1.49142619e+00 -5.19284718e-05 1.44614165e+00 6.34640887e+00 6.88808892e+00]
- sample_rate :
- 4.0
- source_file :
- volve10r12-full-twt-sub3d.sgy
- text :
- C 1 SEGY OUTPUT FROM Petrel 2017.2 Saturday, June 06 2020 10:15:00 C 2 Name: ST10010ZDC12-PZ-PSDM-KIRCH-FULL-T.MIG_FIN.POST_STACK.3D.JS-017534 ÝCroC 3 C 4 First inline: 10090 Last inline: 10150 C 5 First xline: 2150 Last xline: 2351 C 6 CRS: ED50-UTM31 ("MENTOR:ED50-UTM31:European 1950 Based UTM, Zone 31 North, C 7 X min: 433955.09 max: 436589.56 delta: 2634.47 C 8 Y min: 6477439.46 max: 6478790.23 delta: 1350.77 C 9 Time min: -3402.00 max: -2.00 delta: 3400.00 C10 Lat min: 58.25'52.8804"N max: 58.26'37.9493"N delta: 0.00'45.0689" C11 Long min: 1.52'7.1906"E max: 1.54'50.9616"E delta: 0.02'43.7710" C12 Trace min: -3400.00 max: -4.00 delta: 3396.00 C13 Seismic (template) min: -58.55 max: 54.55 delta: 113.10 C14 Amplitude (data) min: -58.55 max: 54.55 delta: 113.10 C15 Trace sample format: IEEE floating point C16 Coordinate scale factor: 100.00000 C17 C18 Binary header locations: C19 Sample interval : bytes 17-18 C20 Number of samples per trace : bytes 21-22 C21 Trace date format : bytes 25-26 C22 C23 Trace header locations: C24 Inline number : bytes 5-8 C25 Xline number : bytes 21-24 C26 Coordinate scale factor : bytes 71-72 C27 X coordinate : bytes 73-76 C28 Y coordinate : bytes 77-80 C29 Trace start time/depth : bytes 109-110 C30 Number of samples per trace : bytes 115-116 C31 Sample interval : bytes 117-118 C32 C33 C34 C35 C36 C37 C38 C39 C40 END EBCDIC
- ns :
- None
- d3_domain :
- None
- epsg :
- None
- corner_points :
- None
- corner_points_xy :
- None
- srd :
- None
- datatype :
- None
Operations on SEISNC using dask
¶
In this simple example we calculate the mean, of the entire cube. If you check the dashboard (when running this example yourself). You can see the task graph and task stream execution.
[8]:
mean = seisnc.data.mean()
mean
[8]:
<xarray.DataArray 'data' ()> dask.array<mean_agg-aggregate, shape=(), dtype=float32, chunksize=(), chunktype=numpy.ndarray>
- dask.array<chunksize=(), meta=np.ndarray>
Array Chunk Bytes 4 B 4 B Shape () () Count 4 Tasks 1 Chunks Type float32 numpy.ndarray
Whoa-oh, the mean is what? Yeah, dask
won’t calculate anything until you ask it to. This means you can string computations together into a task graph for lazy evaluation. To get the mean try this
[9]:
mean.compute().values
[9]:
array(-7.317369e-05, dtype=float32)
Plotting with dask
¶
The lazy loading of data means we can plot what we want using xarray
style slicing and dask
will fetch only the data we need.
[10]:
fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(20, 10))
iline = seisnc.sel(iline = 10100).transpose('twt', 'xline').data
xline = seisnc.sel(xline = 2349).transpose('twt', 'iline').data
zslice = seisnc.sel(twt = 2900, method='nearest').transpose('iline', 'xline').data
q = iline.quantile([0, 0.001, 0.5, 0.999, 1]).values
rq = np.max(np.abs([q[1], q[-2]]))
iline.plot(robust=True, ax=axs[0, 0], yincrease=False)
xline.plot(robust=True, ax=axs[0, 1], yincrease=False)
zslice.plot(robust=True, ax=axs[0, 2])
imshow_kwargs = dict(
cmap='seismic', aspect='auto', vmin=-rq, vmax=rq, interpolation='bicubic'
)
axs[1, 0].imshow(iline.values, **imshow_kwargs)
axs[1, 0].set_title('iline')
axs[1, 1].imshow(xline.values, **imshow_kwargs)
axs[1, 1].set_title('xline')
axs[1, 2].imshow(zslice.values, origin='lower', **imshow_kwargs)
axs[1, 2].set_title('twt')
[10]:
Text(0.5, 1.0, 'twt')