# Identifying compartments in Hi-C data

Long-range chromatin contact patterns reveal that chromatin is partitioned into distinct "compartments" - contiguous stretches of chromatin on the order of megabases in size whose epigenetic modifications and associated protein factors cause them to be positioned close to similarly-modified compartments in three-dimensional space within the nucleus. In this post, we will take a look at one line of approaches for computationally identifying compartments in Hi-C datasets - one that uses the trans contact profile for each locus in the genome as a feature vector.

We will install some non-standard packages: hic-straw to enable us to get contact matrices from .hic files and lib5c to plot contact heatmaps with extra information aligned to the axes.

In [0]:
!pip -q install hic-straw lib5c

     |████████████████████████████████| 16.2MB 284kB/s
|████████████████████████████████| 1.2MB 38.3MB/s
Building wheel for luigi (setup.py) ... done
Building wheel for powerlaw (setup.py) ... done
Building wheel for interlap (setup.py) ... done

In [0]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import coo_matrix
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

from straw import straw
from lib5c.plotters.extendable.extendable_heatmap import ExtendableHeatmap


First, we will load a balanced trans contact matrix for the interaction between chr1 and chr2 in GM12878 cells from the Rao et al. 2014 dataset at 1 Mb resolution (just to keep this example notebook fast). hic-straw allows us to query a remote .hic file from S3 without downloading the whole thing to the local disk, which is quite convenient. We will immediately send the data we get from hic-straw into a standard scipy sparse matrix.

In [0]:
hic_file = 'https://hicfiles.s3.amazonaws.com/hiseq/gm12878/in-situ/combined.hic'
res = 1000000
row, col, data = map(np.array, straw('KR', hic_file, '1', '2', 'BP', res))
row = row // res
col = col // res
matrix = coo_matrix((data, (row, col))).toarray()

HiC version:  7


We will transpose the matrix so that the rows represent bins on chr2 - in this notebook we will be focusing on assigning bins on chr2 to different clusters that represent compartments.

In [0]:
matrix = matrix.T


There are a few outlier rows we identified that formed single-bin outlier clusters in the clustering steps below - we will set these outlier rows to all zeros for now. If there are a lot of zero rows in the matrix, it's possible that they may form their own cluster - a more robust approach would be to drop the zero and outlier rows to ensure they don't influence the clustering.

In [0]:
matrix[[89, 90, 91, 92], :] = 0


We can visualize the contact heatmap to see the classic "plaid" long range interaction pattern characteristic of compartments.

In [0]:
plt.imshow(matrix, cmap='Reds', vmin=0, vmax=1000)

Out[0]:
<matplotlib.image.AxesImage at 0x7f953a147f60>

We can z-score the matrix and perform PCA on the z-scores, keeping only the first principle component. The projection onto the first principle component (which gets stored in loadings) can be used as a continuous "compartment score".

In [0]:
scaler = StandardScaler()

In [0]:
matrix.shape, loadings.shape

Out[0]:
((244, 250), (244, 1))

We can visualize the compartment score along the y-axis of the heatmap. It correlates with the plaid pattern: one compartment has large positive compartment scores while the other has large negative compartment scores.

In [0]:
h = ExtendableHeatmap(
array=matrix,
grange_x={'chrom': 'chr1', 'start': 0, 'end': matrix.shape[1] * res},
grange_y={'chrom': 'chr2', 'start': 0, 'end': matrix.shape[0] * res},
colormap='Reds',
colorscale=(0, 1000)
)
pc1_ax.set_xlim([-10, 10])
pc1_ax.set_ylim([matrix.shape[0], 0])

Out[0]:
(244.0, 0.0)

We can re-do the PCA, keeping two principle components this time.

In [0]:
loadings_2 = PCA(n_components=2).fit_transform(StandardScaler().fit_transform(matrix))


This gives us a two-dimensional embedding of each bin on chr2.

In [0]:
plt.scatter(loadings_2[:, 0], loadings_2[:, 1])

Out[0]:
<matplotlib.collections.PathCollection at 0x7f953a1e4f60>

As an alternative to the continuous compartment score based on the projection onto the first principle component, we can also perform k-means clustering directly on the raw contact matrix. We can color code the points in our two-dimensional PCA plot to show that our k-means clusters represent two distinct groups of bins.

In [0]:
kmeans = KMeans(n_clusters=2, random_state=0).fit(matrix)


Finally, we can overlay the color-coded clusters along the y-axis of the contact heatmap. Once again, we see a good correlation between the plaid pattern and the k-means clusters.

In [0]:
h = ExtendableHeatmap(
array=matrix,
grange_x={'chrom': 'chr1', 'start': 0, 'end': matrix.shape[1] * res},
grange_y={'chrom': 'chr2', 'start': 0, 'end': matrix.shape[0] * res},
colormap='Reds',
colorscale=(0, 1000)
)