Identifying compartments in Hi-C data
Long-range chromatin contact patterns reveal that chromatin is partitioned into distinct "compartments" - contiguous stretches of chromatin on the order of megabases in size whose epigenetic modifications and associated protein factors cause them to be positioned close to similarly-modified compartments in three-dimensional space within the nucleus. In this post, we will take a look at one line of approaches for computationally identifying compartments in Hi-C datasets - one that uses the trans contact profile for each locus in the genome as a feature vector.
We will install some non-standard packages: hic-straw
to enable us to get contact matrices from .hic files and lib5c
to plot contact heatmaps with extra information aligned to the axes.
!pip -q install hic-straw lib5c
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import coo_matrix
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from straw import straw
from lib5c.plotters.extendable.extendable_heatmap import ExtendableHeatmap
First, we will load a balanced trans contact matrix for the interaction between chr1 and chr2 in GM12878 cells from the Rao et al. 2014 dataset at 1 Mb resolution (just to keep this example notebook fast). hic-straw
allows us to query a remote .hic file from S3 without downloading the whole thing to the local disk, which is quite convenient. We will immediately send the data we get from hic-straw
into a standard scipy sparse matrix.
hic_file = 'https://hicfiles.s3.amazonaws.com/hiseq/gm12878/in-situ/combined.hic'
res = 1000000
row, col, data = map(np.array, straw('KR', hic_file, '1', '2', 'BP', res))
row = row // res
col = col // res
matrix = coo_matrix((data, (row, col))).toarray()
We will transpose the matrix so that the rows represent bins on chr2 - in this notebook we will be focusing on assigning bins on chr2 to different clusters that represent compartments.
matrix = matrix.T
There are a few outlier rows we identified that formed single-bin outlier clusters in the clustering steps below - we will set these outlier rows to all zeros for now. If there are a lot of zero rows in the matrix, it's possible that they may form their own cluster - a more robust approach would be to drop the zero and outlier rows to ensure they don't influence the clustering.
matrix[[89, 90, 91, 92], :] = 0
We can visualize the contact heatmap to see the classic "plaid" long range interaction pattern characteristic of compartments.
plt.imshow(matrix, cmap='Reds', vmin=0, vmax=1000)
We can z-score the matrix and perform PCA on the z-scores, keeping only the first principle component. The projection onto the first principle component (which gets stored in loadings
) can be used as a continuous "compartment score".
scaler = StandardScaler()
loadings = PCA(n_components=1).fit_transform(StandardScaler().fit_transform(matrix))
matrix.shape, loadings.shape
We can visualize the compartment score along the y-axis of the heatmap. It correlates with the plaid pattern: one compartment has large positive compartment scores while the other has large negative compartment scores.
h = ExtendableHeatmap(
array=matrix,
grange_x={'chrom': 'chr1', 'start': 0, 'end': matrix.shape[1] * res},
grange_y={'chrom': 'chr2', 'start': 0, 'end': matrix.shape[0] * res},
colormap='Reds',
colorscale=(0, 1000)
)
pc1_ax = h.add_ax(loc='left', name='PC1')
pc1_ax.plot(loadings[:, 0], np.arange(loadings.shape[0]))
pc1_ax.set_xlim([-10, 10])
pc1_ax.set_ylim([matrix.shape[0], 0])
We can re-do the PCA, keeping two principle components this time.
loadings_2 = PCA(n_components=2).fit_transform(StandardScaler().fit_transform(matrix))
This gives us a two-dimensional embedding of each bin on chr2.
plt.scatter(loadings_2[:, 0], loadings_2[:, 1])
As an alternative to the continuous compartment score based on the projection onto the first principle component, we can also perform k-means clustering directly on the raw contact matrix. We can color code the points in our two-dimensional PCA plot to show that our k-means clusters represent two distinct groups of bins.
kmeans = KMeans(n_clusters=2, random_state=0).fit(matrix)
plt.scatter(loadings_2[:, 0], loadings_2[:, 1], c=kmeans.labels_, cmap='cool');
Finally, we can overlay the color-coded clusters along the y-axis of the contact heatmap. Once again, we see a good correlation between the plaid pattern and the k-means clusters.
h = ExtendableHeatmap(
array=matrix,
grange_x={'chrom': 'chr1', 'start': 0, 'end': matrix.shape[1] * res},
grange_y={'chrom': 'chr2', 'start': 0, 'end': matrix.shape[0] * res},
colormap='Reds',
colorscale=(0, 1000)
)
cluster_ax = h.add_ax(loc='left', name='cluster')
cluster_ax.imshow(kmeans.labels_[:, None], cmap='cool', aspect='auto',
interpolation='none', extent=[0, 1, matrix.shape[0], 0])
cluster_ax.set_xlim([0, 1])
cluster_ax.set_ylim([matrix.shape[0], 0]);
Comments
Comments powered by Disqus