Submit on Canvas by Thursday 11/10/22 at 11:49 PM. Ask questions and discuss on Piazza. Have fun!
First data set description
The data (in the file SpikeData05.npz) is available here and can be loaded
as shown below. Please note that a .mat file of the same data for those who prefer MATLAB is available here. Unfortunately, I, personally, will not be able to provide any MATLAB coding support. A solutions notebook exploring the raw data and
generating the figures for problems 1,2, and 3 is made available here.
import numpy as np
data = np.load('SpikeData05.npz')
SpikeWaveforms = data['SpikeWaveforms']
Data description
RawData: This is dealt with for you but if you'd like the description of this
click here.
SpikeWaveforms: This is a matrix of action extracellular potential waveforms sampled simultaneously on all four channels of the tetrode. The size is
$M\times40\times4$, where $M=33787$ is the number of action extracellular (I hope you're happy Brennan)
potentials detected, and the recorded portion of the extracellular potential waveform ("snippet") is in a 40 sample window around the threshold crossing.
x
This is a matrix of raw data recorded from a tetrode (a four channel electrode) in the hippocampus
of a long evans rat --- he was a good boy. The sampling rate is 30 kHz, and the data are simultaneous
voltage signals (in $\mu$V) from each channel. I (Dr. Kemere) considered having you do the spike extraction
for this data --- this involves the steps we (Shay & students in 483/548) discussed in class (1) filtering
out LFP into spike band (600 Hz 2 pole highpass and 2 pole 6 kHz low pass) and (2) finding instances where
one of the channels crosses a threshold. For the sake of time, I, Dr. Caleb Tilo Kemere, decided to do this
for you. Additoinally, for the sake of understanding, I, Shayok "Shay" Dutta, BS, MS, decided to explore the
data for you and provided it within a shared notebook provided in a link shared before. Each snippet is 40
sampples long, and the threshold crossing happens between the 10th and 11th samples in this data. I, Dr. Kemere,
chose a value of 60 $\mu$V for the threshold which Shay reverse discovers in the shared notebook because he chose
to jump into exploring the data without reading the instructions (cuz like let's be real who actually would
read instructions as opposed to hopping in?).
x
I hope you're happy for me correcting this :P! I'm typing this up in the joint Kemere-Ji lab journal club while sitting right next ot you.
You're also working on other stuff so I don't feel as bad for not paying attention but like let's be real...it's already
30 minutes overtime and there's one person who's sleeping right now. She just woke up lol. Back to dozing off with her hand on her chin
acting like she's paying attention lol. I'd be asleep if I wasn't working on other things as well so it is what it is. Are you biting your
lappy? Ah good now you're giving it a hug. Your macbook is happy again.
Problem 1
Plot the spike peaks in 4-D space (10 points)
Find the peak amplitude (post-threshold) of each snippet on each channel. If you look at some waveforms, youll see that the peak typically happens
within 5-10ish data points of the threshold crossing (later peaks are noise!). Search for peaks that are between sample 5 and sample 25. If you are
new to NumPy, the function np.argmax() is a a good place to start!
Make a 6 subplot figure showing a point in each panel for each detected action potential. The point should be the peak amplitude on channel A vs the
peak amplitude on channel B, where A and B are the combinational pairs within {1; 2; 3; 4} (i.e., 1 vs 2, 1 vs 3, 1 vs 4, etc.). Once you've found the
snippet peaks as a list or numpy array, the code below can be useful. Please not that there is also code provided within the notebook link shared earlier
that plots this differently (and more aesthetically in Shay's opinion). Remember to make sure plots are labeled and understandable.
import pandas as pd
import seaborn as sns
PP = pd.DataFrame(np.array(Peaks))
g = sns.PairGrid(PP)
g = g.map_lower(plt.hexbin,gridsize=50, mincnt=1, cmap='seismic',bins='log')
for i, j in zip(*np.triu_indices_from(g.axes, 0)):
g.axes[i, j].set_visible(False)
Please explore your data and briefly make comments on whether or not all the peaks chosen value wise make logical sense.
HINT/NOTE!
x
You can choose whether to find the index of the snippet
of the peak across all 4 channels independently, or jointly (e.g., averaging the 4 channels, and finding the peak of the averaged waveforms, and then
extracting the 4-D value at the resulting index) but note that the index of the peaks should be the same timewise.
Problem 2
Clustering with K-Means (40 points)
Note that SciPy has an implementation for K-Means. For this assignment, however, you are to implement your own version of it. You are more than
welcome to check your answers with the implemented version that is available. Use your implementation to determine the neuron responsible for each recorded
spike. Use only the data from the first channel of the tetrode. You may also use basic NumPy commands, i.e., linear algebra and functions that
you would find on a graphing calculator. Treat each snippet as a point in $\textbf{x}_m\in\mathbb{R}^D,m\in\{1,2,...M\}$, where $D$ is 40, the length of each snippet,
and M is the total number of snippets. In this problem, we will assume there are $K=3$ neurons contributing spikes to the recorded waveform. For computational
use only the first 2000 snippets ($M=2000$).
For each cluster ($k=1,2,3$), create a separate "voltage vs time" plot containing the following:
the cluster center $\textbf{μ}_k$ generated by K-Means as a red waveform trace (i.e., the prototypical extracellular potential for the k-th neuron)
all of the waveform snippets assigned to the k-th cluster.
Plot the objective number $J$ versus iteration number. How many iterations did it take for K-Means to converge?
Repeat problem with K=4 clusters. Make some observations and comments comparing the convergence of the two different cluster choices. Do you think that there
should be more clusters here? Are they clear clusters?
Problem 3
Using Gaussian Mixtures (50 points)
Please note that a reference problem for Gaussian Mixture Models can be found on the old course website.
In Python the scikit-learn package has nice support for mixture models. Install this package with either conda or pip
appropriately. The code below will get you started using it.
from sklearn import mixture
gmix = mixture.GaussianMixture(n_components=2, covariance_type='full')
gmix.fit(peaks)
You will use this package to define and train a model of the 4-channel waveform peaks. To make this
problem more interesting, load the data from the file SpikeData12.npz
(SpikeData12.mat), and find the waveform peaks as in problem 1.
Use the first 5000 snippet peaks (i.e., $\mathbf{x}_m\in\mathbb{R}^4$ are the peak values of the spike snippets in each tetrode channel
for $n=1,...5000$ snippets), and learn the Gaussian mixture parameters for $K=10$ neurons. Plot the resulting cluster assignments in a six
panel plot as in problem 1 with the clusters color-coded.
A well documented example can be found here in the scikit-learn docs.
Find the cluster assignments for the next 5000 snippet-peaks. You will use these as test data to evaluate how many clusters there should be in the data.
Calculate the model likelihood of the second set of 5000 snippet-peaks using the parameters you found in part a. The score_samples()
function returns the loglikelihood of one or more data points. The score() function returns the average loglikelihood for many data points,
which represents the model quality we care about. Now, repeat the EM-learning in a, but with $K=8,9,...,20$. What is the likelihood of the test data
for each model? Which model would you use if you wanted only well-clustered neurons for your analysis? For the most likely value of $K$, plot the cluster
assignments as in a.
Train models with a full covariance matrix as in a and a diagonal covariance matrix (use the covariance_type='diagonal' option). Compare
the model likelihoods on the test data. Which model does a better job?
Problem 4
Clustering using Pearson Correlation (50 points)
This problem replicates a result from the paper "Sub-second dynamics of theta-gamma coupling in hippocampal CA1"
by Zhang et al. eLife 2019. They calculated
the spectrograms of the LFP data and extracted individual theta cycles. They then clustered these spectrograms to
assess whether higher frequency gamma oscillations preferentially occur at certain phases of theta.
Data Description
The data for this problem can be found
here and loaded as:
import numpy as np
import matplotlib.pyplot as plt
data = np.load('hw4problem5.npy')
plt.imshow(data[0,:,:])
The data you are given are a series of matrices of size $K \times N_{\theta}$ where $K$ is 81 and $N_{\theta}$, the number of
phase bins is 20. The phase goes from $-\pi$ to $\pi$ (i.e., phase = np.linspace(-np.pi,np.pi,20)), and the
frequency vector can be calculated as freqs = np.arange(20,182,2)[::-1]. Note that in the frequency vector
and the data, the 0-th entry corresponds to the highest frequency (180 Hz). This is so that plotting with plt.imshow()
works nicely.
Recall from lecture how I made fun of my former mentee for messing up the distance
formula. In this paper they use a distance measure derived from the Pearson correlation, specifically $d(x,y)=1-\rho(x,y)$.
As shown in this paper, clustering with Pearson
correlation distance is equivalent to using K-Means on normalized data, $\tilde{x}=\frac{x-\tilde{x}}{\sigma_x}$, where $\tilde{x}$ is the mean of
the data point $x$ (across its dimensions) and $\sigma_x$ is its standard deviation.
To cluster, you can use your K-means implementation or the scikit-learn one. Assuming you've created a matrix of normdata, which is
normalized properly, here's some example code that works.
Cluster the theta cycle spectrograms into $K=4$ clusters. Plot the mean spectrogram of each cluster as in their Figure 1D.
Find 3 data points (i.e., theta cycles) which are close to the centroid of each cluster and plot them. How similar do individual
cycles look to the averages?
Find 3 data points which are close to the boundaries of two or more of the different clusters. (You can do this by finding the
distances to each cluster center, and sorting the data points by how close they are!). Plot these examples. Give a qualitative
description.