Homework 4 (150 points)

Submit on Canvas by Thursday 11/10/22 at 11:49 PM. Ask questions and discuss on Piazza. Have fun!

First data set description

The data (in the file SpikeData05.npz) is available here and can be loaded as shown below. Please note that a .mat file of the same data for those who prefer MATLAB is available here. Unfortunately, I, personally, will not be able to provide any MATLAB coding support. A solutions notebook exploring the raw data and generating the figures for problems 1,2, and 3 is made available here.

import numpy as np
data = np.load('SpikeData05.npz')
SpikeWaveforms = data['SpikeWaveforms']

Data description

x
This is a matrix of raw data recorded from a tetrode (a four channel electrode) in the hippocampus of a long evans rat --- he was a good boy. The sampling rate is 30 kHz, and the data are simultaneous voltage signals (in $\mu$V) from each channel. I (Dr. Kemere) considered having you do the spike extraction for this data --- this involves the steps we (Shay & students in 483/548) discussed in class (1) filtering out LFP into spike band (600 Hz 2 pole highpass and 2 pole 6 kHz low pass) and (2) finding instances where one of the channels crosses a threshold. For the sake of time, I, Dr. Caleb Tilo Kemere, decided to do this for you. Additoinally, for the sake of understanding, I, Shayok "Shay" Dutta, BS, MS, decided to explore the data for you and provided it within a shared notebook provided in a link shared before. Each snippet is 40 sampples long, and the threshold crossing happens between the 10th and 11th samples in this data. I, Dr. Kemere, chose a value of 60 $\mu$V for the threshold which Shay reverse discovers in the shared notebook because he chose to jump into exploring the data without reading the instructions (cuz like let's be real who actually would read instructions as opposed to hopping in?).
x
I hope you're happy for me correcting this :P! I'm typing this up in the joint Kemere-Ji lab journal club while sitting right next ot you. You're also working on other stuff so I don't feel as bad for not paying attention but like let's be real...it's already 30 minutes overtime and there's one person who's sleeping right now. She just woke up lol. Back to dozing off with her hand on her chin acting like she's paying attention lol. I'd be asleep if I wasn't working on other things as well so it is what it is. Are you biting your lappy? Ah good now you're giving it a hug. Your macbook is happy again.

Problem 1

Plot the spike peaks in 4-D space (10 points)

Find the peak amplitude (post-threshold) of each snippet on each channel. If you look at some waveforms, youll see that the peak typically happens within 5-10ish data points of the threshold crossing (later peaks are noise!). Search for peaks that are between sample 5 and sample 25. If you are new to NumPy, the function np.argmax() is a a good place to start!

Make a 6 subplot figure showing a point in each panel for each detected action potential. The point should be the peak amplitude on channel A vs the peak amplitude on channel B, where A and B are the combinational pairs within {1; 2; 3; 4} (i.e., 1 vs 2, 1 vs 3, 1 vs 4, etc.). Once you've found the snippet peaks as a list or numpy array, the code below can be useful. Please not that there is also code provided within the notebook link shared earlier that plots this differently (and more aesthetically in Shay's opinion). Remember to make sure plots are labeled and understandable.

import pandas as pd
import seaborn as sns
PP = pd.DataFrame(np.array(Peaks))
g = sns.PairGrid(PP)
g = g.map_lower(plt.hexbin,gridsize=50, mincnt=1, cmap='seismic',bins='log')
for i, j in zip(*np.triu_indices_from(g.axes, 0)):
    g.axes[i, j].set_visible(False) 

Please explore your data and briefly make comments on whether or not all the peaks chosen value wise make logical sense.

HINT/NOTE!

x
You can choose whether to find the index of the snippet of the peak across all 4 channels independently, or jointly (e.g., averaging the 4 channels, and finding the peak of the averaged waveforms, and then extracting the 4-D value at the resulting index) but note that the index of the peaks should be the same timewise.

Problem 2

Clustering with K-Means (40 points)

Note that SciPy has an implementation for K-Means. For this assignment, however, you are to implement your own version of it. You are more than welcome to check your answers with the implemented version that is available. Use your implementation to determine the neuron responsible for each recorded spike. Use only the data from the first channel of the tetrode. You may also use basic NumPy commands, i.e., linear algebra and functions that you would find on a graphing calculator. Treat each snippet as a point in $\textbf{x}_m\in\mathbb{R}^D,m\in\{1,2,...M\}$, where $D$ is 40, the length of each snippet, and M is the total number of snippets. In this problem, we will assume there are $K=3$ neurons contributing spikes to the recorded waveform. For computational use only the first 2000 snippets ($M=2000$).

  1. For each cluster ($k=1,2,3$), create a separate "voltage vs time" plot containing the following:
    1. the cluster center $\textbf{μ}_k$ generated by K-Means as a red waveform trace (i.e., the prototypical extracellular potential for the k-th neuron)
    2. all of the waveform snippets assigned to the k-th cluster.

  2. Plot the objective number $J$ versus iteration number. How many iterations did it take for K-Means to converge?

  3. Repeat problem with K=4 clusters. Make some observations and comments comparing the convergence of the two different cluster choices. Do you think that there should be more clusters here? Are they clear clusters?

Problem 3

Using Gaussian Mixtures (50 points)

Please note that a reference problem for Gaussian Mixture Models can be found on the old course website.

In Python the scikit-learn package has nice support for mixture models. Install this package with either conda or pip appropriately. The code below will get you started using it.

from sklearn import mixture
gmix = mixture.GaussianMixture(n_components=2, covariance_type='full')
gmix.fit(peaks)

You will use this package to define and train a model of the 4-channel waveform peaks. To make this problem more interesting, load the data from the file SpikeData12.npz (SpikeData12.mat), and find the waveform peaks as in problem 1.

  1. Use the first 5000 snippet peaks (i.e., $\mathbf{x}_m\in\mathbb{R}^4$ are the peak values of the spike snippets in each tetrode channel for $n=1,...5000$ snippets), and learn the Gaussian mixture parameters for $K=10$ neurons. Plot the resulting cluster assignments in a six panel plot as in problem 1 with the clusters color-coded. A well documented example can be found here in the scikit-learn docs.
  2. Find the cluster assignments for the next 5000 snippet-peaks. You will use these as test data to evaluate how many clusters there should be in the data. Calculate the model likelihood of the second set of 5000 snippet-peaks using the parameters you found in part a. The score_samples() function returns the loglikelihood of one or more data points. The score() function returns the average loglikelihood for many data points, which represents the model quality we care about. Now, repeat the EM-learning in a, but with $K=8,9,...,20$. What is the likelihood of the test data for each model? Which model would you use if you wanted only well-clustered neurons for your analysis? For the most likely value of $K$, plot the cluster assignments as in a.
  3. Train models with a full covariance matrix as in a and a diagonal covariance matrix (use the covariance_type='diagonal' option). Compare the model likelihoods on the test data. Which model does a better job?

Problem 4

Clustering using Pearson Correlation (50 points)

This problem replicates a result from the paper "Sub-second dynamics of theta-gamma coupling in hippocampal CA1" by Zhang et al. eLife 2019. They calculated the spectrograms of the LFP data and extracted individual theta cycles. They then clustered these spectrograms to assess whether higher frequency gamma oscillations preferentially occur at certain phases of theta.

Data Description

The data for this problem can be found here and loaded as:

import numpy as np
import matplotlib.pyplot as plt

data = np.load('hw4problem5.npy')
plt.imshow(data[0,:,:])

The data you are given are a series of matrices of size $K \times N_{\theta}$ where $K$ is 81 and $N_{\theta}$, the number of phase bins is 20. The phase goes from $-\pi$ to $\pi$ (i.e., phase = np.linspace(-np.pi,np.pi,20)), and the frequency vector can be calculated as freqs = np.arange(20,182,2)[::-1]. Note that in the frequency vector and the data, the 0-th entry corresponds to the highest frequency (180 Hz). This is so that plotting with plt.imshow() works nicely.

Recall from lecture how I made fun of my former mentee for messing up the distance formula. In this paper they use a distance measure derived from the Pearson correlation, specifically $d(x,y)=1-\rho(x,y)$. As shown in this paper, clustering with Pearson correlation distance is equivalent to using K-Means on normalized data, $\tilde{x}=\frac{x-\tilde{x}}{\sigma_x}$, where $\tilde{x}$ is the mean of the data point $x$ (across its dimensions) and $\sigma_x$ is its standard deviation.

To cluster, you can use your K-means implementation or the scikit-learn one. Assuming you've created a matrix of normdata, which is normalized properly, here's some example code that works.

from sklearn.cluster import KMeans
K = 4
ndata = np.reshape(normdata, \
(normdata.shape[0], normdata.shape[1] * normdata.shape[2]))
kmeans = KMeans(\
n_clusters=K, random_state=0, algorithm="full")\
.fit(ndata)

cluster0 = np.reshape(\
kmeans.cluster_centers_[0,:], \
(normdata.shape[1], normdata.shape[2]))

plt.imshow(cluster0)

  1. Cluster the theta cycle spectrograms into $K=4$ clusters. Plot the mean spectrogram of each cluster as in their Figure 1D.
  2. Find 3 data points (i.e., theta cycles) which are close to the centroid of each cluster and plot them. How similar do individual cycles look to the averages?
  3. Find 3 data points which are close to the boundaries of two or more of the different clusters. (You can do this by finding the distances to each cluster center, and sorting the data points by how close they are!). Plot these examples. Give a qualitative description.