<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>gilgi.org (Posts about notebook)</title><link>https://gilgi.org/</link><description></description><atom:link href="https://gilgi.org/categories/notebook.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2020 &lt;a href="mailto:site@gilgi.org"&gt;gilgi&lt;/a&gt; </copyright><lastBuildDate>Sat, 23 May 2020 20:36:38 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Identifying compartments in Hi-C data</title><link>https://gilgi.org/blog/compartments/</link><dc:creator>gilgi</dc:creator><description>&lt;div class="cell border-box-sizing text_cell rendered"&gt;&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;&lt;img src="https://gilgi.org/images/blog/compartments.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt;Long-range chromatin contact patterns reveal that chromatin is partitioned into distinct "&lt;a href="https://pubmed.ncbi.nlm.nih.gov/19815776/"&gt;compartments&lt;/a&gt;" - contiguous stretches of chromatin on the order of megabases in size whose epigenetic modifications and associated protein factors cause them to be positioned close to similarly-modified compartments in three-dimensional space within the nucleus. In this post, we will take a look at one line of approaches for computationally identifying compartments in Hi-C datasets - one that uses the &lt;em&gt;trans&lt;/em&gt; contact profile for each locus in the genome as a feature vector.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://colab.research.google.com/github/gilgi/gilgi.github.com/blob/src/posts/compartments.ipynb"&gt;&lt;img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gilgi.org/blog/compartments/"&gt;Read more…&lt;/a&gt; (5 min remaining to read)&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;</description><category>compartments</category><category>genomics</category><category>Hi-C</category><category>k-means</category><category>notebook</category><category>PCA</category><category>Python</category><guid>https://gilgi.org/blog/compartments/</guid><pubDate>Mon, 13 Apr 2020 04:00:00 GMT</pubDate></item><item><title>Optimal BH-FDR control</title><link>https://gilgi.org/blog/optimal-bh-fdr-control/</link><dc:creator>gilgi</dc:creator><description>&lt;div class="cell border-box-sizing text_cell rendered"&gt;&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;&lt;img src="https://gilgi.org/images/blog/optimal-bh-fdr-control.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt;Controlling for &lt;a href="https://en.wikipedia.org/wiki/False_discovery_rate"&gt;false discovery rate (FDR)&lt;/a&gt; has become a ubiquitous step in statistical analyses that test multiple hypotheses, like those commonly performed in the genomics field. The seminal method for controlling FDR, the &lt;a href="https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini%E2%80%93Hochberg_procedure"&gt;Benjamini-Hochberg (BH-FDR) procedure&lt;/a&gt; guarantees FDR control under some simple assumptions, but how does it really perform in the limit of an ideal dataset and ideal statistical test? In this post, we'll dive into this question and reveal that the proportion of truly non-null data points has a major influence on the behavior of BH-FDR.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://colab.research.google.com/github/gilgi/gilgi.github.com/blob/src/posts/optimal_bh_fdr_control.ipynb"&gt;&lt;img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gilgi.org/blog/optimal-bh-fdr-control/"&gt;Read more…&lt;/a&gt; (6 min remaining to read)&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;</description><category>genomics</category><category>notebook</category><category>Python</category><category>statistics</category><guid>https://gilgi.org/blog/optimal-bh-fdr-control/</guid><pubDate>Fri, 15 Nov 2019 05:00:00 GMT</pubDate></item><item><title>First eigenvector vs principle component of a symmetric matrix </title><link>https://gilgi.org/blog/eigenvector/</link><dc:creator>gilgi</dc:creator><description>&lt;div class="cell border-box-sizing text_cell rendered"&gt;&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;Two concepts that are easy to confuse are &lt;a href="https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors"&gt;eigenvectors&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Principal_component_analysis"&gt;principle components&lt;/a&gt;. When the matrix in question is symmetric, there is a relationship between the first eigenvector and the projection of the data onto its first principle component. In this post, we'll use &lt;a href="https://en.wikipedia.org/wiki/Diagonalizable_matrix#Diagonalization"&gt;diagonalization&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Singular_value_decomposition"&gt;singular value decomposition&lt;/a&gt; to try to shed some light on this.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://colab.research.google.com/github/gilgi/gilgi.github.com/blob/src/posts/eigenvector.ipynb"&gt;&lt;img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gilgi.org/blog/eigenvector/"&gt;Read more…&lt;/a&gt; (4 min remaining to read)&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;</description><category>linear algebra</category><category>notebook</category><category>Python</category><guid>https://gilgi.org/blog/eigenvector/</guid><pubDate>Sun, 13 Oct 2019 04:00:00 GMT</pubDate></item><item><title>Dota 2 friend network visualization</title><link>https://gilgi.org/blog/dotafriends/</link><dc:creator>gilgi</dc:creator><description>&lt;div class="cell border-box-sizing text_cell rendered"&gt;&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;&lt;img src="https://gilgi.org/images/blog/dotafriends.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt;What better way is there to enjoy games than with friends? Of course, some friends have better synergies than others. What if there was a way to quantify those synergies and draw a group of friends as a graph, with high-synergy friends placed close together and low-synergy friends placed further apart?&lt;/p&gt;
&lt;p&gt;In this post, we'll use the &lt;a href="https://docs.opendota.com/"&gt;OpenDota API&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Multidimensional_scaling"&gt;multidimensional scaling (MDS)&lt;/a&gt;, and &lt;a href="https://networkx.github.io/"&gt;networkx&lt;/a&gt; to visualize a network of friends playing Dota 2 together.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://colab.research.google.com/github/gilgi/gilgi.github.com/blob/src/posts/dotafriends.ipynb"&gt;&lt;img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gilgi.org/blog/dotafriends/"&gt;Read more…&lt;/a&gt; (7 min remaining to read)&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;</description><category>API</category><category>Dota 2</category><category>graph</category><category>MDS</category><category>notebook</category><category>Python</category><category>visualization</category><guid>https://gilgi.org/blog/dotafriends/</guid><pubDate>Wed, 14 Aug 2019 04:00:00 GMT</pubDate></item><item><title>Estimation variance of the Poisson rate parameter</title><link>https://gilgi.org/blog/poisson-rate-estimation-variance/</link><dc:creator>gilgi</dc:creator><description>&lt;div class="cell border-box-sizing text_cell rendered"&gt;&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;&lt;img src="https://gilgi.org/images/blog/poisson-rate-estimation-variance.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt;Many applications of statistics to genomics involve modelling read counts from &lt;a href="https://en.wikipedia.org/wiki/DNA_sequencing#High-throughput_methods"&gt;high-throughput sequencing experiments&lt;/a&gt; as originating from a &lt;a href="https://en.wikipedia.org/wiki/Poisson_point_process#Interpreted_as_a_counting_process"&gt;Poisson process&lt;/a&gt; (or as something similar based on the closely-related &lt;a href="https://en.wikipedia.org/wiki/Negative_binomial_distribution"&gt;negative binomial distribution&lt;/a&gt;). In these experiments, we often want to estimate the true concentration $\mu$ of a particular DNA or RNA sequence in a sequencing library. This corresponds to the "rate" parameter of a Poison process, while the total number of reads sequenced can be thought of as related to its "elapsed time". How does the sequencing depth of the dataset (the total number of reads) impact the variance in the estimate of the true concentration? Read on to find out!&lt;/p&gt;
&lt;p&gt;&lt;a href="https://colab.research.google.com/github/gilgi/gilgi.github.com/blob/src/posts/poisson_rate_estimation_variance.ipynb"&gt;&lt;img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gilgi.org/blog/poisson-rate-estimation-variance/"&gt;Read more…&lt;/a&gt; (3 min remaining to read)&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;</description><category>genomics</category><category>notebook</category><category>Python</category><category>statistics</category><guid>https://gilgi.org/blog/poisson-rate-estimation-variance/</guid><pubDate>Tue, 13 Aug 2019 04:00:00 GMT</pubDate></item><item><title>Partially solving the "knight dialer" problem using graph exploration</title><link>https://gilgi.org/blog/knight-dialer/</link><dc:creator>gilgi</dc:creator><description>&lt;div class="cell border-box-sizing text_cell rendered"&gt;&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;&lt;img src="https://gilgi.org/images/blog/knight-dialer.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt;This week, a friend showed me the &lt;a href="https://leetcode.com/problems/knight-dialer/"&gt;knight dialer coding problem&lt;/a&gt;. As usual, I couldn't resist framing it as a graph exploration problem - read on to see how I tackled it.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://colab.research.google.com/github/gilgi/gilgi.github.com/blob/src/posts/knight_dialer.ipynb"&gt;&lt;img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gilgi.org/blog/knight-dialer/"&gt;Read more…&lt;/a&gt; (5 min remaining to read)&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;</description><category>coding problem</category><category>graph</category><category>notebook</category><category>Python</category><guid>https://gilgi.org/blog/knight-dialer/</guid><pubDate>Thu, 20 Jun 2019 04:00:00 GMT</pubDate></item><item><title>Visualizing trajectories of Dota Auto Chess games using LDA and TSNE</title><link>https://gilgi.org/blog/autochess/</link><dc:creator>gilgi</dc:creator><description>&lt;div class="cell border-box-sizing text_cell rendered"&gt;&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;&lt;img src="https://gilgi.org/images/blog/autochess.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt;The Dota 2 custom map "Dota Auto Chess" is taking the gaming world by storm. Auto Chess is almost like a card game in which players attempt to improve their hand round after round while their "chesses" (cards) battle the other players in the game automatically. In this post, we'll explore the potential of &lt;a href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation"&gt;latent Dirichlet allocation&lt;/a&gt; to model strategies (specific combinations of cards that are often played together) as "topics". We will then try to visualize the evolution of Auto Chess "hands" over the course of individual games in terms of their topics using &lt;a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding"&gt;t-SNE&lt;/a&gt;. These kinds of visualizations might help us learn how to more effectively evolve our strategies.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://colab.research.google.com/github/gilgi/gilgi.github.com/blob/src/posts/autochess.ipynb"&gt;&lt;img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gilgi.org/blog/autochess/"&gt;Read more…&lt;/a&gt; (8 min remaining to read)&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;</description><category>Dota 2</category><category>LDA</category><category>notebook</category><category>Python</category><category>TSNE</category><category>visualization</category><guid>https://gilgi.org/blog/autochess/</guid><pubDate>Sat, 18 May 2019 04:00:00 GMT</pubDate></item><item><title>MLE estimation of mean parameter for scaled distributions</title><link>https://gilgi.org/blog/mle-scaled-mean/</link><dc:creator>gilgi</dc:creator><description>&lt;div class="cell border-box-sizing text_cell rendered"&gt;&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;&lt;img src="https://gilgi.org/images/blog/mle-scaled-mean.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt;It's pretty common to run into statistical models that fit some kind of normalization factors that can be used to scale different data points to a common, comparable scale. When this happens to a &lt;a href="https://en.wikipedia.org/wiki/Location%E2%80%93scale_family"&gt;location-scale family distribution&lt;/a&gt; like the normal distribution, estimating the mean parameter of such a scaled distribution is pretty straightforward. In many genomics contexts, however, non-location-scale distributions such as the &lt;a href="https://en.wikipedia.org/wiki/Poisson_distribution"&gt;Poisson&lt;/a&gt; or &lt;a href="https://en.wikipedia.org/wiki/Negative_binomial_distribution"&gt;negative binomial (NB)&lt;/a&gt; distribution are used quite often. For example, it's common to model RNA-seq read counts using NB distributions scaled by the library size (i.e., some measure of the total sequencing depth). In this post, we'll take a deep dive into how to estimate the parameters of these "scaled" statistical models.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://colab.research.google.com/github/gilgi/gilgi.github.com/blob/src/posts/mle_scaled_mean.ipynb"&gt;&lt;img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gilgi.org/blog/mle-scaled-mean/"&gt;Read more…&lt;/a&gt; (21 min remaining to read)&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;</description><category>genomics</category><category>notebook</category><category>Python</category><category>statistics</category><guid>https://gilgi.org/blog/mle-scaled-mean/</guid><pubDate>Wed, 08 May 2019 04:00:00 GMT</pubDate></item><item><title>Simulating maps of 3D genome architecture</title><link>https://gilgi.org/blog/heatmap-simulation/</link><dc:creator>gilgi</dc:creator><description>&lt;div class="cell border-box-sizing text_cell rendered"&gt;&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;&lt;img src="https://gilgi.org/images/blog/heatmap-simulation.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gilgi.org/research"&gt;My research&lt;/a&gt; focuses on analyzing maps of the 3D structure of the genome. One of the things I always find helpful when trying to understand complex data like this is trying to simulate it myself to get a sense for how the data might behave. In this post, we'll add complexity step-by-step and work our way up to simulating realistic-looking genome folding maps from scratch!&lt;/p&gt;
&lt;p&gt;&lt;a href="https://colab.research.google.com/github/gilgi/gilgi.github.com/blob/src/posts/heatmap_simulation.ipynb"&gt;&lt;img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gilgi.org/blog/heatmap-simulation/"&gt;Read more…&lt;/a&gt; (8 min remaining to read)&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;</description><category>3d genomes</category><category>genomics</category><category>notebook</category><category>Python</category><category>statistics</category><guid>https://gilgi.org/blog/heatmap-simulation/</guid><pubDate>Mon, 29 Apr 2019 04:00:00 GMT</pubDate></item><item><title>Linear regression in Python (UPenn ENM 375 guest lecture)</title><link>https://gilgi.org/blog/linear-regression/</link><dc:creator>gilgi</dc:creator><description>&lt;div class="cell border-box-sizing text_cell rendered"&gt;&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;&lt;img src="https://gilgi.org/images/blog/linear-regression.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt;I was recently invited to give a guest lecture in the course &lt;a href="https://catalog.upenn.edu/courses/enm/"&gt;ENM 375 Biological Data Science I - Fundamentals of Biostatistics&lt;/a&gt; at the University of Pennsylvania on the topic of &lt;a href="https://en.wikipedia.org/wiki/Linear_regression"&gt;linear regression&lt;/a&gt; in Python. As part of my lecture, I walked through this notebook. It might serve as a useful reference, covering everything from simulation and fitting to a wide variety of diagnostics. The walkthrough includes explanations of how to do everything in vanilla &lt;a href="https://numpy.org/"&gt;&lt;code&gt;numpy&lt;/code&gt;&lt;/a&gt;/&lt;a href="https://www.scipy.org/"&gt;&lt;code&gt;scipy&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://scikit-learn.org/"&gt;&lt;code&gt;scikit-learn&lt;/code&gt;&lt;/a&gt;, and &lt;a href="https://www.statsmodels.org/"&gt;&lt;code&gt;statsmodels&lt;/code&gt;&lt;/a&gt;. As a bonus, there's even a section on &lt;a href="https://en.wikipedia.org/wiki/Logistic_regression"&gt;logistic regression&lt;/a&gt; at the end.&lt;/p&gt;
&lt;p&gt;Read on for more!&lt;/p&gt;
&lt;p&gt;&lt;a href="https://colab.research.google.com/github/gilgi/gilgi.github.com/blob/src/posts/linear_regression.ipynb"&gt;&lt;img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gilgi.org/blog/linear-regression/"&gt;Read more…&lt;/a&gt; (19 min remaining to read)&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;</description><category>machine learning</category><category>notebook</category><category>Python</category><category>statistics</category><guid>https://gilgi.org/blog/linear-regression/</guid><pubDate>Tue, 16 Apr 2019 04:00:00 GMT</pubDate></item></channel></rss>