Since the lockdowns started in March, Zoom meetings have become an intgral part of of a typical workday. Like most people, I was just using my AirPods Pro, until I dropped one of them and something went wrong with the noise cancellation. They still work, but when I speak I hear myself echoed in the right earbud, which ended up being very annoying. I wanted to get a Blue Yeti USB microphone to replace them, but at this point they were all sold out. I figured a small simple solution would be to get a wireless lavalier mic. This is pretty much as unobtrusive as the AirPods Pro.

Unfortunately, I discovered the lavalier mic sounded terrible. Maybe because I got a pretty cheap one, but when using the "Test Mic" feature in Zoom with it I figured I wouldn't want to bother anyone with such bad audio. Instead I got a popular condensor microphone and an audio interface, the next best thing after a Blue Yeti USB microphone (which were still sold out).

When testing the condensor microphone in Zoom's "Test Mic" feature I was very disappointed, it sounded terrible. I tried recording the mic in Quicktime, and it sounded much better!


As a demonstration, here is a recording using a shotgun mic recorded directly from the source, and a captured recording of what the "Test Mic" feature sounds like in Zoom with the same microphone.

What I heard from the "Test Mic" feature sounded much worse than anyone had ever sounded in a Zoom meeting for me. I was wondering if my audio actually sounded that bad on the end of a receiver. To test this, I set up a Zoom meeting between my laptop and my desktop, and recorded microphone tests as they sounded on the receiving laptop transmitted from the desktop with the microphone plugged in. It turns out the "Test Mic" feature is nowhere near representative of the actual sound of the microphone over Zoom.

While I had this set up, I recorded several different options for microphones for the sake of a comparison. I also recorded local native microphone samples to learn in what ways Zoom distorts audio.

There are a number of options in Zoom to disable various audio filters, but I couldn't understand what they did, so I left them all on the defaults.


Here is the collection of all microphones I tested with this setup. Most interesting is probably the right hand column of audio clips that demonstrates what the different microphones sounds like over a Zoom meeting.

关于安卓手机连接电脑的那些事儿(干货分享)_安卓玩机_菲菲博客:2021年12月18日 - 当我伔手机连接到电脑时(手机开启USB调试),不管你用的什么手机管理助手,其实都是...可众在线免费查看Android源伕码歪果仁的网站,访问速度超快,关键是...

手机CPU性能排行怎么看 手机CPU天梯图2021年10月最新版_电脑百事网:2021年10月3日 - 在刚刚过去的九月份,目前手机CPU厂商中,主要新增的是苹果A12仿生处理器,另外国内华为也在9月份正式发布了麒麟980处理器。而就最近的表现来看,依旧是...

I was also surprised to learn the AirPods Pro sounds pretty bad over Zoom too. A common first tip for good Zoom meeting audio is to use any headset instead of the microphone built into your computer or webcam. But at least to my ears the built in microphones in my iMac, my webcam, and my laptop all sound better than than the AirPods Pro. This could be another symptom of them being slighly broken.

Taking all the recorded microphones, I ranked them by listening to them pairwise and deciding which one I thought sounded better. I also thought about which options were more convenient, using subjective considerations such as whether they are wired, comfortable, mobile, taking up space, etc. The figure below summarizes how I would rank the microphone options along both axes.

Screen Shot 2024-07-03 at 18.55.10.png

Zoom meetings are exhaustng, to the point where I have no interest what so ever to attend any of the upcoming remote versions of conferences. I feel very bad for students who have to endure remote school days and lectures. There are many reasons Zoom meetings are tiring, but I belieave a major reason is the poor audio quality from the people speaking.

This comparison helped me decide between my options. I wish wireless options were better so I could more easily take meetings away from my work desk to get some more variety into my workday.

/Valentine Svensson


Slide-seq is a promising technology for the study of cellular tissue anatomy.It uses a clever strategy to randomly deposit barcoded RNA-capture beads on slides. To record the location of the beads the barcode is read off using in situ sequencing. Following this, a slice of tissue can be deposited on the slide, where the beads can capture RNA from the cells of the tissue. After this step the beads can be treated as from a Drop-seq experiment, and with high throughput sequencing one can arrive at a count matrix of beads x genes, with the paired information of spatial coordinates of the beads.

Slide-seq was published a year ago, and the dataset is still the largest published in terms of number of observations with 2,522,640 beads. The data ended up being difficult to analyze however, due to extremely low counts per bead. Even with very large numbers of observations it is difficult to learn structure in count data of counts are small.

Recently, an update called ‘Slide-seqV2’ was published. In the paper they describe testing Slide-seq and Slide-seq2 on serial sections of brain tissue and find a ten-fold improvement in average counts.

This is on newly generated data, and I was curious how the previous data compares. Relative improvement within an experiment tells you about how much better the protocol is, but what I wanted to know was how much easier this experiment would be to analyze. For this sequencing depth needs to be considered, which may not necessarily be matched between the old and the new data.

After downloading the data made public by the authors, I converted it, combined it with the old Slide-seq data, and made a plot comparing the number of counts per bead between the protocols. (I also live streamed this process, available here)


手机谷歌浏览器用不了_谷歌梯子 - 电影天堂:2021-9-18 · 手机谷歌浏览器用不了_谷歌梯子,电影天堂为您提供手机谷歌浏览器用不了_谷歌梯子迅雷下载和剧情,手机谷歌浏览器用不了_谷歌梯子简介:为什么我的谷歌商店和谷歌浏览器都用不起了?480x854 - 47KB - PNG为什么我的谷歌商店和谷歌浏览器都用 ...

Preferably you would have at least a few thousand UMI’s per observation when doing analysis. This is a step in the right direction, but it still might require the supervised style analysis presented in both the papers, which make use of additional scRNA-seq data as a guide.

The analysis is available on pc端电脑梯子, and I have made available H5AD files containing the Slide-seq here and Slide-seq2 data here.

/Valentine Svensson


Recently Google 苹果手机能用的梯子 aggregated anonymized data on community mobility for 131 countries based on cell phone location data. The data spanned the time period between February 16 and March 29, and made clear how people have limited visits to places categorized as “Retail & recreation”.

The data was made available as line graphs in PDF reports per country, meaning you need to open each country you want to look at in a separate document. I wanted to see how population behaviour had changed globally by looking at many countries at the same time. But the data used to make the curves in the reports was not available for direct download.

In the PDF’s the curves are vector graphics, so in theory the underlying data is embedded within the PDF. Using Illustrator I could access the curve element, but I couldn’t see a way to extract the data that builds the curve. I tried finding tools explicitly made for extracting data from PDF’s and encountered a tool called “norma”, but I couldn’t get it to work.

The line graphs are pretty clear and have simple colors, so I thought it would be pretty simple to extract the values even if I had the bitmap images of the graphs.

I used Greenshot to take screenshots of the regions in the reports in the range -80% to 80%, which are the labeled extremes on the y-axis, while covering the entire x-axis.


After loading a screenshot in Python I could isolate the pixels belonging to the curve by using a threshold on the red channel.


These pixel values together with the knowledge of the axis scales can be used to recover the original data.

# Binarize the image
mask = 0.5 < img[:, :, 0]

# The `.argmin(0)` method will return the first index where
# we see a black pixel along the y-axis in the image.
# Images have negative y-axis, flip to get positive values.
pixel_value = -mask.argmin(0)

# We know the y-axis spans the interval [-80, 80]. To get the
# original values, scale values to the fraction of vertical pixels
# the observed values are at. Then scale these fractions to the
# [-80, 80] interval.
percent = 160 * (pixel_value + img.shape[0]) / img.shape[0] - 80

# The reported dates start on February 16 and end on March 29.
# Make a DateRangeIndex defined for a single day's resolution.
dates = pd.date_range(, 2, 16),, 3, 29))

daily_percent = (
        # Use the `cut()` function to bin horizontal pixels into
        # intervals corresponding to the days.
        'bins': pd.cut(np.arange(percent.shape[0]), bins=dates.shape[0]),
        'percent': percent
    # Group the percent values by the day bins.
    # Extract the median percent value for each bin.

# Finally, add the dates to the data
daily_percent['date'] = dates

Now we have the percentage values for each day in a DataFrame and can re-plot it to recreate the reported curve.

2021最新翻墙方法-iphone苹果手机ios可众用的翻墙梯子VPN ...:2021-4-14 · 2021最新电脑ios上好用的加速器安卓iphone苹果手机翻墙vpn梯子app:ExpressVPN评测报告-最好用的翻 墙VPN推荐 iPhone苹果手机翻墙软件与教程 2021最新电脑ios上好用的加速器安卓iphone苹果手机翻墙vpn梯子app:PureVPN深度评测-最好用的翻墙VPN ...

Having validated that it was possible to recover the data given the screenshots, I spent some time creating screenshots for 47 countries. With these I could simply loop over the images, create the DataFrames, and combine them at the end. With the combined data I was able to make the combined plot I wanted:


The notebook I used to do this, as well as a CSV of the resulting DataFrame, are available on Github.

/Valentine Svensson


The war photographer Abbas advised people who wanted to improve their photography to “Get a good pair of walking shoes… and fall in love.”

Photography is my favourite and most serious pastime, and my main creative outlet. I have been interested in photography since 2002 when I borrowed a friends early digital point-and-shoot camera and by some accidental fluke managed to take an otherworldly photo of the main street in my hometown. A few months later I got my first digital camera, started taking pictures, and haven’t really stopped since. My tastes in photos have changed over time many times, as has digital photography itself as it has become an ubiquitous part of culture. For the last 10 years I have tried to actively think about how to improve and consider photography as an art form.

One thing has not changed since I started in the early 2000’s: my preferred mode of taking pictures has always been to walk for hours upon hours, and shoot what I encounter. In cities, through forests, over mountains. The advice from Abbas has a profound resonance with me.

A recent change in many people’s life is the introduction of ever-present personal tracking systems, enabled through increasingly cheap advanced sensors in electronic devices such as cell phones. The most recent phone I bought a couple of years ago came with the ability to passively count the steps of its owner throughout the days. Reflecting on this on yet another day of walking with my camera the other week I realized that I could put Abbas advice to the test: what is the effect of walking on my photographic productivity?

The number of steps I have taken in each day for the last year and a half was easily extracted from my phone. From my photo management software I could parse out the number of photos I had captured in each of the days.


2021最新翻墙方法-iphone苹果手机ios可众用的翻墙梯子VPN ...:2021-4-14 · 2021最新电脑ios上好用的加速器安卓iphone苹果手机翻墙vpn梯子app:ExpressVPN评测报告-最好用的翻 墙VPN推荐 iPhone苹果手机翻墙软件与教程 2021最新电脑ios上好用的加速器安卓iphone苹果手机翻墙vpn梯子app:PureVPN深度评测-最好用的翻墙VPN ...

To answer this, I set up a generalized linear regression model with a Poisson likelihood using brms, predicting the number of photos taken as a response to the number of steps I had taken. After fitting the model I investigated the marginal effect and found that on average I capture about 50 additional photos for every 10,000 additional steps I take in a day.


华为手机怎么安装WHATSAPP(页 1) - 电脑网络 - 福步外贸 ...:2021-9-28 · 大家好, 本人刚买了华为手机, 发现不论怎么安装都安装不了whatsapp. 做外贸的手机不能安装whatsapp 等于没有手机一样。 有会安装的麻烦留言指教下,万分感谢 福步外贸论坛(FOB Business Forum) 福步外贸论坛是中国最大的专业外贸论坛 ...

An R notebook with the analysis for this post is available on Github.

/Valentine Svensson


Typical single cell RNA-seq studies are still working in an exploratory sense, aiming to characterize the diversity in a cell population by identifying discrete cell types or trajectories of variation. But there are also efforts trying to identify how different experimental or observed conditions affect different cell types. For example, by sorting blood cells from healthy and diseased people, scRNA-seq could in principle be used to assign cell types to the cells, and differential expression for each cell type could be investigated to see if the disease is affecting any particular cell type more than others.

The task of differential expression in essence boils down to creating a model of the form \( \text{expression} = b \cdot \text{[is diseased]} + a \) and investigating whether the \( \text{[is diseased]} \) label helps predicting the gene expression level.

In such a procedure, each individual contributes many cells of each cell type. So in total, the number of observations (cells) might be in the thousands. The number of individuals contributing the cells might be much smaller, and this raises the question of how to think about statistical power for the test. It is important that the between-individual variation is not larger than the between-conditions variation.

When comparing two assigned cell types against each other it is straightforward to account for between-individual variation. Since (ideally) each individual contributes with samples of each cell type, an individual-specific effect can be added to the model. In other words it means writing a model of the style \( \text{expression} = b \cdot \text{[is cell type IV]} + \sum_{i} c_i \cdot \text{[is mouse i]} + a \).

If the goal is to compare gene expression between healthy and diseased mice for a particular cell type, it is no longer possible to separate the individual effect from the disease effect. A model would be \( \text{expression} = b \cdot \text{[is diseased]} + \sum_{i} c_i \cdot \text{[is mouse i]} + a \), but whether the disease effect goes into the \( b \) parameter or the \( c \) parameter is ambiguous. There is complete confounding between condition and individual. If you compare this model to \( \text{expression} = \sum_{i} c_i \cdot \text{[is mouse i]} + a \) you will not be able to tell a difference. We would for example never have data from the same individual being both healthy and diseased.

designArtboard 1@4x.png

This means that when accounting for individual-specific effects on gene expression, it will never be possible to get low p-values. Not without making the assumption that the individuals are directly comparable, and we do not need to consider individual-level effects. This might be reasonable for lab mice, but when analyzing cells from human donors, there will be a lot of genetic and lifestyle variation between individuals.

A strategy to deal with this is use multilevel regression rather than linear regression. A simple way to think about these models is as means at different levels, with level-specific variance that observations vary around. Say there is an overall mean expression level, and a between-mouse variance. Then for each mouse there is a mean expression level, with between-cell variance for that particular mouse.

Such a model can be compared to a model where there is an overall mean, with condition-specific variance. Then each condition (healthy vs diseased) has a mean, as well as between-mouse variance for the conditions. And in turn, each mouse, within a condition, has a specific mean with between-cell variance. If grouping the mice by disease give a better model fit for the gene expression values, then this is evidence of the gene being perturbed by the disease state. While also accounting for between-individual variation!

multilevelArtboard 1@4x.png

To test this approach to differential expression, I looked at a dataset published by Hrvatin et al 2018 [1]. They perform scRNA-seq on cells from the visual cortex of mice. These are mice that have been living in complete darkness for several days. Some of the mice are exposed to light for an hour before they are sacrificed and cells are collected. The concept is that now you can identify which cell types in the visual cortex get activated by the exposure to light by investigating the differential expression of no light vs 1 hour light exposure.

手机怎么挂香港梯子,电脑硬件基本常识请教一个电脑常识电脑是谁...:2021年4月28日 - 手机怎么挂香港梯子,电脑硬件基本常识请教一个电脑常识电脑是谁发明的。...手机怎么挂香港梯子,最好有冰封系统或。内存的问题,你可众上绿色兵团?或者...

I used three different models for testing differential expression. An ordinary linear regression model on log(counts + 1) expression values using the R lm function; a Poisson generalized linear model on observed counts using the R glm function; and a Poisson multilevel model on the observed counts using the glmer function in the lme4 R package. In formula notation, these would be the models:

LM H0: ‘log(expr + 1) ~ log(total_count)’
LM H1: ‘log(expr + 1) ~ log(total_count) + stim’

GLM H0: ‘expr ~ offset(log(total_count))’
GLM H1: ‘expr ~ offset(log(total_count)) + stim’

LME H0: ‘expr ~ offset(log(total_count)) + (1 | sample)’
LME H1: ‘expr ~ offset(log(total_count)) + (1 | stim/sample)’

To obtain p-values, I performed likelihood ratio tests between the H0 and H1 models. Note that the multilevel model is the only one aware of sample (mouse) grouping!

In the results I am highlighting genes that Hrvatin et al also point out as particularly differential between no light and 1h light.


Ordinary regression clearly identifies the genes discussed by Hrvatin et al. But the p-values are extremely small (for reference, the red dashed line indicate p=0.05). The model is not acknowledging the fact that cells from the same mouse probably have a lot of internal correlation, and therefore the model fit is overly certain. The same phenomenon can be observed in the GLM model. The multilevel model has much more reasonable p-values.

An alternative proposed strategy is a two-step one. First, define cell types on the single cell level from scRNA-seq. Then, for each cell type sum up the counts for each biological sample (mouse) and treat them as “pseudobulk”, as if they were sorted cells from each sample. Now we can apply differential expression tests on the summed level. I applied the same models, but for such summed pseudobulks, now having only 17 observations rather than 1,856.


The OLS regression now seems much more reasonable, the GLM model still has the same issue of seeming overly confident, and interestingly, the Poisson multilevel model for the pseudobulk gives exactly the same results as for the single cell data. This idea of pseudobulk analysis was explored by Lun & Marioni 2017 [2].

There’s no particular ground truth here for a proper comparison, but the investigation suggests that it is possible to use multilevel models for differential expression. I like that the multilevel approach ties directly to the hierarchical nature of the experiment. Here I only used basic R functions rather than specialized DE packages. The specialized packages would do more clever things like empirical Bayes shrinkage across the genes, or use more sophisticated offsets than the total_count I use here. I just wanted to compare these strategies directly with naive implementation. (For an interested reader, I suggest the recent comparison by Soneson & Robinson 2018 [3].)

Finally, I used Poisson models here; there are plenty of reasons to think that negative binomial models would be more appropriate, and I might revisit these questions to test that at some point. Notebooks, R scripts, and other material for making this post are available on Github. Thanks to Lior Pachter for editorial feedback on this post.


[1] S. Hrvatin et al., “Single-cell analysis of experience-dependent transcriptomic states in the mouse visual cortex,” Nat. Neurosci., vol. 21, no. 1, pp. 120–129, Jan. 2018.

【图】安卓手机上的梯子,怎么分享给pc用?_1_四川论坛_爱卡汽车:安卓手机上的梯子,怎么分享给pc用?(1/1) 楼主:0z 返回帖子 评论图片 分享 手机上有好使的梯子,没有pc版本 有什么app能把手机上的梯子分享出来给pc用? ...

[3] C. Soneson and M. D. Robinson, “Bias, robustness and scalability in single-cell differential expression analysis,” Nat. Methods, Feb. 2018.

/Valentine Svensson


华为share用不了了-问题已经解决 - 华为P30系列分享交流 ...:2021-5-16 · 华为share显示已经连接上电脑了,但是电脑不弹出手机界面了。昨天还好好的,今天就用不了了。问题已经解决了,原来多屏协同不能用,同时移动热点也不能用,找到投影设置,提示此设备不支持Miracast。进入设备管理器 ... 华为share用不了了-问题已经解决

In single cell RNA-seq, I tend to apply this principle by looking at a lot of published data, to see how the properties of data in general correspond to data that I’m analyzing for the sake of scientific investigation.

Single cell RNA-seq data consists of observed counts of molecules generated from different genes in different cells. Sometimes however, authors only publish counts that have been scaled. One very popular unit is CPM, counts per million, where the counts in each cell is divided by the sum of counts, then multiplied by a million. Recently I have also noticed data with similar scaling to 10,000 or 100,000, or to a median across the data set.

To compare data from different sources, it is easier to interpret differences if you know the per-cell total counts. It allows you to think of the counts in the data in the context of exposure. I recently wanted to look at some 10X data, but it turned out the expression matrix did not consist of counts.

Screen Shot 2018-10-25 at 14.32.47.png

Looking at the mean-variance relation for all the genes in the data it was clear that they had been log-transformed.


Even after exponentiating the expression levels, they were not integers. However, looking at the sum of expression levels in each cell it could easily be seen that the counts had been scaled to 10,000 counts per cell.

Screen Shot 2018-10-25 at 14.36.39.png

One would think that this is a lossy transformation: we wouldn’t know how large the magnitudes in a cell were before division. However, the very problem with scaling of counts means we can recover the value used to scale the cell.

It’s not clear if this is a biological property or a property of the measurement process, but out of the 20,000 or so genes, in scRNA-seq data of any given cell, the vast majority of genes will have 0 counts. Follows by a substantial number of genes with 1 counts, followed by a slightly smaller number of genes with 2 counts, etc.

Since the scaling by division just changes the magnitude of the counts, but not the discrete nature, the most common value will correspond to 0, the second most common value to 1, etc. This plot of a few random cells make this relation clear.


2021年10月最新手机CPU天梯图-电脑配置网:2021年10月30日 - 入门组装电脑配置清单 网吧组装电脑配置清单返回...每一伕手机处理器出来的时候很多用户会充满好奇...打开图片”,即可众查看高清大图,也可众直接下载...

By simply multiplying the cells by the identified scaling factors, the true counts are recovered. After un-scaling, the typical relation between total count and detected genes is clear, and the negative binomial mean-variance relation for genes is apparent.


I wanted to write about this, because the fact that it is possible to recover the original counts after scaling indicates that just scaling the counts is not sufficient for normalization. The properties if discrete counts are still present in the data, and will be a factor in every analysis where these scaled counts are used. It will make interpretation hard, since the relation between magnitude and the probability of generating values magnitude will not be tied to each other as clearly as with counts.

Notebook with this analysis is available here.

/Valentine Svensson


Malaria is one of the world's deadliest diseases with 200 million cases in a year, leading to half a million deaths. It is caused by mosquito-borne Plasmodium parasites. During parasitic infections our adaptive immune system responds through a complex series of cellular interactions, precipitated by events at the molecular level. The dynamic ability to suppress and activate genes allow immune cells to maintain surprising plasticity. Switching the states of the cells in response to infection require a network of gene regulatory events to occur.

At the heart of the adaptive immune system are T helper cells. Under healthy conditions, naive T helper cells are quiescent and roam through the body awaiting signals. Once T helper cells receive activation signals they differentiate into different roles depending on the type of infection. In malaria, naive cells differentiate into two distinct populations with different roles: Th1 cells, which signal macrophages and killer T cells to attack infected cells. And Tfh cells, which activate survival programs and promote antibody affinity in B cells.

Differentiation into a role is orchestrated by a small number of transcription factors regulating expression of large sets of genes. Identifying transcription factors that direct the differentiation fate is of key importance to understand the immune system, and to what degree in vivo cell-extrinsic factors bias the cell fate has not been studied in detail.

翻墙不犯法!|科学上网绝对不能干的事!给搭梯子的你一些安全 ...:2021-6-20 · 放弃VPN吧,目前最稳定的科学上网方式。电脑端SSR使用教程详解,花点耐心看完。翻墙原理众及VPN与SSR优劣比较,不再为翻墙发愁。授人与鱼不如授 ...

跟大家推荐一个好用的梯子软件 - 计算机科学 寄托家园留学论坛:2021-11-23 · 大家都知道,万里长城万里长,长城里面是故乡。有时候在大陆想要使用国际社交软件真的很难,特别是留学生应该感同身受···想上学校的官网,卡到爆炸。放假回家,突然有 ... 跟大家推荐一个好用的梯子软件 ,寄托家园留学论坛

We used Gaussian process methods to identify and investigate genes with expression dynamics over time. These are nonparametric machine learning models that can represent and identify any smooth dynamic function. (Later we extended this method to identify spatially variable genes in spatial tissue gene expression systems (4).)

Unfortunately five time points are not sufficient to make conclusions about the dynamics of gene expression. However, we could exploit the fact that the immune cells are not synchronized in their response. An extension of the Gaussian process model let us estimate latent per-cell “time points” which could be earlier or later than cell collection times. This Gaussian process latent variable model simultaneously model many genes as different smooth functions, and correct time values so genes are consistent between each other. Now we could investigate the temporally dynamic genes and looked at when they started, peaked, and stopped expression. This way we learn the order transcriptional programs like activation or proliferation.

The most crucial aspect of our experiment was that the T helper cells had two different fates. Meaning any gene’s dynamic expression was the result of either of two functions: a Th1 fate function, or a Tfh fate function. We needed a way to analyse gradual commitment to either fate, without knowing which cells were on which path.

We solved this problem using a mixture of overlapping Gaussian processes. A mixture model is a statistical way to formalize the cell fate assignment problem described above. Machine learning algorithms can be used to infer the most likely groupings of cells into the components of the mixture. In our case cells would be assigned to either a “Tfh component”, or a “Th1 component”, with ambiguous assignments for uncommitted cells.

After fitting the mixture model to assign cells to fates, we could investigate how each individual gene related to the fates. This was the final key to uncovering transcriptional programs for the fate choice. This recapitulated the known Tfh markers Bcl6 and Cxcr5, but in particular let us associate the maintenance of Tcf7 with the Tfh fate. The mixture model identified the Tcf7 antagonist Id2 as the main transcription factor associated with the Th1 fate, along with many known Th1 genes.

The peak times of dynamic genes could be linked to when cells started committing to fates. This identified receptors which were compatible with signaling molecules expressed by the monocytes we had collected in parallel. We confirmed that mice with reduced numbers of monocytes produced smaller fractions of Th1 cells. (Mice with reduced number of B cells on the other hand produce smaller fractions of Tfh cells).

It was possible that T helper cells driven to either fate by the communicating cells had already been primed to respond to only these cells. Using the T cell receptor sequence in the T helper cells as a natural clonal barcode, we found that both Th1 and Tfh cells shared sequences, and must have originated from common naive progenitors.

In summary our data indicate that in malaria activated T helper cells start expressing Tcf7. After proliferation rate acceleration the cells express receptors which receive signals from monocytes or B cells. Cells receiving B cell signals maintain Tcf7 expression and become Tfh cells, while cells receiving monocyte signals replace Tcf7 expression with Id2 expression and become Th1 cells.

Biology is entering a new era with the rich measurements from single cell RNA sequencing. It allows us to use machine learning to ask explicit questions about the data. The power is particularly clear from the joint cellular and molecular perspective. Going forward I hope for more developments that allow us to ask and answer questions from this perspective, and more opportunities for methods like the overlapping mixture of Gaussian processes that can link the two.

References and notes

  1. pc楼梯,pc楼梯知识 - 电气工程 - 造价通:7天前 - 答:都可众在绘制输入中定义绘制上去的 用普通的梁...楼梯井就是双跑楼梯的梯板之间的空挡。查看详情 ...手机版访问:pc楼梯 相关话题 路桥市政 成本...
  2. V. Svensson et al., Power analysis of single-cell RNA-sequencing experiments. Nat. Methods (2017), doi:10.1038/nmeth.4220.
  3. T. Lönnberg et al., Single-cell RNA-seq and computational analysis using temporal mixture modeling resolves TH1/TFH fate bifurcation in malaria. Science Immunology. 2, eaal2192 (2017).
  4. V. Svensson, S. A. Teichmann, O. Stegle, SpatialDE: identification of spatially variable genes. Nat. Methods. 15, 343–346 (2018).
  5. I. C. Macaulay et al., Single-Cell RNA-Sequencing Reveals a Continuous Spectrum of Differentiation in Hematopoietic Cells. Cell Rep. 14, 966–977 (2016).

This post was originally an essay I submitted for a competition on work done in the field of genomics during PhD. Thanks to Sarah Teichmann, Lynn Yi, and Eduardo Beltrame for feedback on the text.

/Valentine Svensson


In the last few months a number of interesting brain single cell datasets have been published. The new promising STARmap method was pusblished by Wang et al in Science using 求稳定的梯子 as example applications. Even more recently, two back to back mouse brain atlases were published in Cell. Zeisel et al characeterized the nervous system of adolescent mice through sequencing over 500,000 cells from 19 different regions using the 10X Genomics Chromium technology. Saunders et al studied adult mouse brain using almost 700,000 cells from 9 regions of the brain using the Drop-seq method.

All three papers contain some data from the frontal cortex region. An interesting exercise is to see how these datasets relate to one another, and if they can be analysed as a single entity.

The Wang et al STARmap data measures a panel of 166 genes in the medial prefrontal cortex (mPFC), across 3,700 cells from three separate biological samples. The STARmap data does not have any clusters annotated, but they do have spatial locations in the tissue preserved! The data from Saunders et al and Zeisel et al are transcriptome wide, so much larger. But it would be interesting to see how well the small panel of STARmap genes would work to integrate and analyse these datasets.

The anterior cortex data from Zeisel et al consists of 14,000 cells with annotations for 53 clusters (though some of these seem to be enteric neurons which is a bit odd?). The frontal cortex dataset from Saunders et al has over 70,000 cells annotated with 62 clusters. For the sake of simplicity, here I randomly sampled 15,000 cells from the Saunders data.

When the datasets are combined we get an expression matrix with 32,000 cells with counts in 158 genes. To analyze the data together I use scVI an autoencoder based method that have been working well in my current research work. Technology specific effects can be accounted for by providing them as 手机电脑都能用的梯子.

Because we are looking at very few genes here, running inference on the scVI model takes about 10 minutes for 100 epochs. The scVI model then gives you access to a low-dimensional (10 dimensional in this case) latent representation space which can be used to generate counts for all genes consistent with the structure in the original data. The main point with this is that many machine learning and statistical methods that have problems with count data

Once the scVI model is fitted, it is a good idea to create a tSNE of the latent space, and color the cells by various known factors to try to figure out what might contribute to variation in the data.


We see that similar patterns of densities overlap for all the datasets. Of course we need to remind ourselves not to read too much into the tSNE representation; the important part is the 10-dimensional latent representation of the data.

We can note that on the larger annotated scale annotation from the Zeisel and Saunders we can see overlap between related terms in the tSNE! Immune cells close to immune cells, neurons close to neurons, etc.


We can also color the cells by the complete cluster annoation. But with so many clusters it becomes hard to tell the colors apart! (And here I even collapsed some of the Saunders et al cluster names so they would fit in the plot.)


To learn about how the different annotations in the different datasets relate to each other we can use Ward linkage on the cell group centers to create a common dendrogram for all the groups. Because of the design of the latent space, we can simply use Euclidean distances. First we look at the broad cell classes from both the annotated datasets.

From the dendrogram it is clear that the class annotations are consistent between the datasets, as we observed in tSNE above.


We turn then to looking at all 110 annotated clusters in the frontal cortex data from the two publications. This dendrogram is on the larger side, bu the text is stil legible. While it seems pretty consistant within a given dataset, the mixing does not seem completely right. For example neurons in the Saunders et al data are mixed with oligodendrocytes from the Zeisel et l data. This is not necessarily terrible, we are only working with 158 genes from the STARmap panel, the original studies used thousands of genes to define the cell types.


The STARmap cells do not come with cell type annoations. They do however come with spatial locations for every cell! We can try to link the spatial locations in the STARmap data with the cell type annotion in the othe data sets.

scVI has specialized functionality for these tasks, both for semi-supervised learning to annotate cells, and for working with combined spatial and scRNA-seq datasets. Here however we will just do something simple to see how it works out.

Since Euclidean distances are meaningful in the latent 10-d space, we can use a kNN classifier using clusters or classes as targets.

Now we can plot the spatial locations of the STARmap cells, but annotate the cells with predicted clusters from the scRNA-seq datasets. This way we can qualitatively evaluate whether the clusters are consistent with spatial structure in the brain.


In these plots above, each "stripe" consists of different biological samples, and it wouldn't make sense to plot the cells from different biological samples in a single plane. I think they are even from different mice.

From this it seems that prediction to the annotation from Saunders et al seem more spatially organized in the STARmap data.

The cell types in the Zeisel et al data are predicted more spread out in the STARmap data. I suspect there is an issue with the cell type annotation here, since excitatory neurons (TEGLU*) are mapped to the region at the right which is supposed to be non-neuronal. I might have misunderstood how the cell clusters are indexed in the Loom files for cortex. (There is also an odd thing where the website says there are 7,000 anterior cortex cells, but the Loom file contains 15,000).

A notebook with all analysis, including fitting the scVI model, is available here.

/Valentine Svensson


A while after the invention of the microscope, scientists started applying various chemical dyes when looking at biological samples in the form of tissues from diseased organs. This revealed anatomical structures on the microscopic scale in many plant and animal tissues, birthing the field of histology. Microscopy of stained tissue is still an enormously important field, and 手机能用的梯子免费 have devised metrics for diagnosing many types of conditions and diseases based on patterns of cells in stained tissues.

The original classification of the six neuronal layers in the cerebral cortex for example was made on the basis of morphology observed using “Nissl staining” which highlights the endoplasmic reticulum.


Since the original studies of brain microanatomy in the late 19th century we have figured out that the functions and phenotypes of cells (including neurons) are determined by regulation of gene expression programs. Many histological assays are now defined by examining particular gene products examined using immunohistochemical assays.

Researchers are now making extremely rich molecular measurements with preserved spatial coordinates in tissues. At the same time as genome wide measurement are being miniaturized into technologies that preserve spatial location in tissues, technologies for counting individual mRNA molecules in cells using imaging are being increasingly 电脑上如何用梯子. We are reaching a point where these two approaches to spatial gene expression analysis are converging to similarly rich assays, with the ability to look at thousands of gene products in thousands of individual cells in tissues.

Similar to how pathologists have defined visual morphological “markers” for cell types in tissues, it is possible to look directly at gene expression or protein abundance throughout the tissue to learn about its structure.

The function of regions of tissues will in many cases be determined by sets of co-expressed genes. This means there is a coupling between spatial locality of function and particular sets of genes.

The automatic expression histology (AEH) model which we published in Nature Methods earlier this year attempts to explicitly model this coupling through a probabilistic model:

$$ \begin{align*} P(Y, \mu, Z, \sigma^2, \Sigma) &= P(Y | \mu, Z, \sigma^2) \cdot P(\mu | \Sigma) \cdot P(Z), \\ P(Y | \mu, Z, \sigma^2) &= \prod_{k=1}^K \prod_{g=1}^G \mathcal{N}(y_g | \mu_k, \sigma^2)^{(z_{g, k})}, \\ P(\mu | \Sigma) &= \prod_{k=1}^K \mathcal{N}(\mu_k | 0, \Sigma), \\ P(Z) &= \prod_{k=1}^K \prod_{g=1}^G \left( \frac{1}{K} \right)^{(z_{g, k})}. \end{align*} $$

Here \( Y \) is a set of vectors of \( G \) genes observed in \( N \) locations in the tissue. The first couple of equations just describe a probabilistic model of clusters: a Gaussian mixture model with \( K \) components. Here the matrix \( Z \) is binary, and only has one non-zero entry per row which assigns an observation to a component. The last two lines though specify that means in the GMM should be generated from smooth spatial functions, thus giving the form of a mixture of Gaussian processes. The covariance matrix \( \Sigma \) encodes spatial locality.

Modeling the spatial functions with Gaussian processes means you do not need to know what they look like, we just assume they are somewhat smooth.

In this model the expression levels of genes are governed by underlying shared spatial functions. The information we know from the experiment is \( Y \) and \( \Sigma \). We don’t know what the functions look like (\( \mu \) and \( \sigma \)) and we don’t know which genes share which function (\( Z \)).

This figure gives a visual representation of the idea of how the binary matrix \( Z \) assigns genes \( g \) to underlying patterns \( k \):

Artboard 1@4x.png

Implemented in the SpatialDE package there is a Bayesian inference algorithm for learning posterior probabilities of the assignments \( z \) between genes and hidden spatial patterns that, together, give rise to spatial co-expression. In our paper we applied it to breast cancer tissue and found a set of immune genes spatially co-expressed in a region of tumor infiltration. We also applied it to mouse olfactory bulb data and got a clear picture of the layer structure of the organ. As well as spatial organization of cell types in a small region of mouse hippocampus.

As a new example here, we can apply the method to an extremely interesting recent dataset that was produced using the novel STARmap technology. This protocol uses in situ DNA sequencing to decode barcodes bound to mRNA molecules within cells of a given tissue, integrating the concepts of CLARITY tissue clearing, in situ pc可众用的梯子, and seqFISH/MERFISH combinatorial gene tagging.

The dataset we use here as an example probed 1,020 genes in in 1,549 single cells over a region of size of about 700 µm by 400 µm in the primary visual cortex of a mouse (corresponding to Figure 4 in the STARmap paper).

First we run the SpatialDE significance test, and find that 101 of the genes significantly depend on spatial locations of the cells in the tissue. To perform AEH, we use these genes and set the characteristic length scale in the tissue. Setting these parameters is not trivial, and here we are a bit helped by domain knowledge. We happen to know that layers are on the order of a couple of 100 µm, and we expect about eight of them. We then set the length scale to be 100 µm and select eight underlying spatial patterns to assign the genes to.

After running AEH we are given assignments of the genes, as well as the underlying spatial function for a pattern, which is evaluated for all the cells in the dataset/tissue.

HPC: hippocampus, cc: corpus callosum, L1-L6: Layers 1 to 6.

HPC: hippocampus, cc: corpus callosum, L1-L6: Layers 1 to 6.

Approximately annotating the image by matching it with the figure in the STARmap paper, we see that the patterns learned by AEH match up with the classical layers of the cortex. While we used the number of layers when running AEH, no information about the structure of them were provided to the model. The layer structure that was originally observed by the morphology of stained cells was now automatically identified by the molecular measurements. One way of viewing this is that we now have access to 101 “colors” from the genes, and the “hues” we get by mixing combinations of them in the different layers form patterns that are easy to see computationally.

Compared to other spatial data this dataset is on the larger side, measuring 1500 locations, but even so AEH converges in a reasonable time. Of course the data sets will get larger, but seeing these technologies spread to more research labs is going to be very interesting. The same way many cell types have been discovered by single cell RNA sequencing, many new microanatomical structures will be discovered in many organs using these spatial gene expression methods. I hope tools such as AEH in SpatialDE will be helpful to the fields of tissue and cell biology.

The analysis of the STARmap data can be seen in a Jupyter notebook here.

Thanks to Carlos Talavera-Lópes for feedback on the post.

/Valentine Svensson


Two recent computational methods, scVI by 电脑上如何用梯子 and DCA by Eraslan & Simon et al, are the most promising methods for large scale scRNA-seq analysis in a while. This post will describe what they do and why it is exciting, and demonstrate the results of running them on three recent datasets.

In almost all cases, to do anything useful in scRNA-seq the tens of thousands of genes measured need to be summarised and simplified. An extremely effective way to summarise many variables through their covariance structure is principal component analysis (PCA). However, scRNA-seq data consists of counts, which have particular behaviours that cause issues with the interpretation and of PCA.

Potential ways of dealing with this is either figuring out how to transform count data to emulate the characteristics of continuous Gaussian data, or to reformulate PCA for the count setting. While data transformations have historically had some success, they don’t perform so well for low count numbers, which is the case when massive scRNA-seq experiments are economically feasible.

A couple of years ago Risso et al successfully created a generalized linear factor analysis model for scRNA-seq counts called ZINB-WaVE, based on a zero-inflated negative binomial (ZINB) count distribution. Omitting some further features, in this model underlying rates of observations of mRNA molecules from specific genes are modelled by a low-dimensional collection of continuous factors. Every cell has a particular hidden value for these factors, and to fit the model all cells are investigated and assigned the most likely factor values. (In these equations, red color indicates parameters that need to be inferred.)

$$ \begin{align*} Y &\sim \text{ZINB}(\mu, \phi, \pi) \\ \log \mu &= {\color{red}{W}}^{T} {\color{red}{X}} \\ \cdots \end{align*} $$

With these ZINB based methods, the data does not need to be scaled, nor normalised, and in principle the common step of selecting highly variable genes is not necessary. The data just plugs in.

Inherently, learning the \( X \) means that each cell is compared to all other cells. The factors (\( W \)) can be investigated in attempts to deduce meaning, and in that way we gain knowledge. But if you show the model a new cell \( y \), it doesn’t know what to do with it. The inference will need to be rerun with the entire dataset including this new cell.

Two new methods, called scVI (single cell variational inference) and DCA (deep count autoencoder) rethinks this model, by moving from factor analysis to an autoencoder framework using the same ZINB count distribution. (Their titles and abstracts phrase them as imputation methods, which is a bit odd and substantially undersell them!) The two methods have slightly different parameterizations, but conceptually (abusing notation a bit), this represents with they both do:

$$ \begin{align*} Y &\sim \text{ZINB}(\mu, \phi, \pi) \\ \log \mu &= f_{\mu}(X; {\color{red}{\theta_f}}) \\ \cdots \\ \hat{X} &= g(Y; {\color{red}{\theta_g}}) \end{align*} $$

A parametric function from the observed space (gene counts) to a low-dimensional space is fitted (\( g \)), at the same time as a function that maps the low dimensional space to ZINB parameters (\( f \)). For any given cell, you can apply \( g \) to get its \( X \)-representation, then apply \( f \) on that to get parameters, and evaluate the likelihood of the cell. The functions \( g \) and \( f \) are parameterised as neural networks because these are flexible and efficient.

三星手机如何和outlook.com同步邮件、联系人和日历 ...:2021-12-5 · 和hotmail的关联,华为和IOS做得最好,完美云同步邮件,联系人,日历,无论电脑端或者手机端。 小米手机也能云同步hotmail,但是手机端保存的联系人无法上传到服务器,过段时间就丢了。小米8用了10天,1550卖了,继续用MATE10。

By analysing how different regions of the \( X \)-space map to gene expression, markers and differential expression can be investigated. And on the converse, if you perform clustering in the \( X \)-space, you can take new cells and evaluate which clusters the \( g \) function maps them to.

To illustrate what the methods do, we will apply them to three recent datasets. I picked smaller datasets on the order of ~2,500 cells because I wanted to quickly run them on desktop just for testing. One from Rosenberg et al 2018, where I randomly sampled 3,000 out of 150,000 developing mouse brain cells. The second one is a taxon of Peripheral sensory neurons from, which is part of Zeisel et al 2018. And finally a dataset of male mouse germ cells that I found on GEO, but I couldn’t find an associated paper for (Lukassen 2018). I ran both DCA and scVI with default parameters: DCA produces a 32-dimensional representation, and scVI a 10-dimensional. To qualitatively inspect the results I ran tSNE on the representations, and colored the cells based on labels provided from the data sources.


The DCA method is implemented around the anndata Python package, and is very easy to run on any data you have. The scVI implementation requires you to manually wrangle your data into TensorFlow tensors with correct data types, which can be frustrating if you are not used to it. This does however imply that if you want to scale the inference using out-of-core strategies, scVI directly supports that.

In terms of run time, DCA was much faster than scVI, finishing in a few minutes. scVI took about an hour for each dataset. A large component is probably that DCA implements automatic early stopping, while scVI will run for as many epochs as you tell it, even if the fit doesn’t improve.

The scVI method has the option to account for discrete nuisance variables (batch effects), but I did not try this. And even without it, it seems to align the two mice quite well in the Lukassen 2018 data!

I am curious if there is a way to also account for continuous nuisance parameters (e.g. amplification cycles). In ZINB-WaVE this is straightforward, because it is a GLM, but it is not so clear here. I mentioned clustering in the X-space, it might be possible to formulate these models as structured autoencoders (SVAE), and encourage the \( g \) function to learn a representation that favours cell type segmentation. Notebooks where I ran the methods on the different datasets are available here.

/Valentine Svensson


The other day SpatialDE, our statistical test for spatially variable genes in spatial transcriptomics data was published in Nature Methods. The goal of the method is to find genes which in some way depend on the spatial organization of tissues, and for our paper we analysed generated with Spatial Transcriptomics, SeqFISH, and MERFISH. As a demonstration of how to use SpatialDE I also analysed some recent heart tissue. The underlying principle in SpatialDE can be hard to get to grips with, and the aim of this post is to provide an example that hopefully explains it.

One very practical reason it can be hard to see what is going on is that it is generally difficult to make plots to visualize data in more than 2 dimensions so that noise is easily interpreted. In terms of spatial gene expression, it is then easiest to make an example with a 1-dimensional tissue, and illustrate things that way.

There is actually a spatial transcriptomics dataset which is in fact 1-dimensional! In a study by Junker et al, the authors studied zebrafish embryos at the 15 somites stage by straightening them, then segmenting them into about 100 slices, each roughly 18 µm thick. For each slice digital RNA sequencing was performed providing expression levels for the entire transcriptome, annotated with the head-to-tail coordinates of the slice: a protocol they call “tomo-seq”. This provides an excellent example to illustrate what SpatialDE mean by “spatially variable genes”.


Defining spatial gene expression

Without pre-specifying the shapes of gene expression we need a way to express properties of how genes could behave across the coordinates. An effective way to look at this is to consider spatial coordinates as locally informative. If you know expression in a coordinate, then nearby coordinates should be pretty similar, while coordinates further away are still somewhat similar. In statistical terms, we say that coordinates \( x_i \) and \( x_j \) covary in a way that decreases with distance between \( x_i \) and \( x_j \). A common way to express this is through the Gaussian covariance function: \[ cov(x_i, x_j) = \exp \left(-\frac{|x_i − x_j|^2}{2 \cdot l^2}\right) \]

The parameter \( l \) describes the scale at which expression levels are locally similar. We illustrate the meaning of this by considering a length scale of 126 µm in the zebrafish embryo.


手机USB调试不了,PC识别不了手机设备,但是插上USB依然..._CSDN论坛:PC : ubuntu 10.04 手机: 4.11 原生系统现在就是测试不了 , lsusb 也看...没用过linux,不过有一次我把手机的USB存储打开了,就连不上了不能调试了。但是...

\[ \begin{align*} y_g &\sim \mathcal{N}(0, \Sigma), \\ \Sigma_{i, j} &= cov(x_i, x_j) \end{align*} \]

Models like this where the covariance between points are parameterized are known as Gaussian processes.

Spatial and non-spatial variance

The curves above are random draws, but have you ever seen data that looks so clean? More likely there is additional noise which we cannot explain by only considering the spatial covariance between zebrafish slices. This can be expressed as \[ y_g \sim \mathcal{N}(0, \sigma^2 \cdot (\Sigma + \delta \cdot I)) \]

The \( \sigma^2 \) parameter is the overall magnitude of the variance of \( y_g \), while the \( \delta \) parameter scales how much of the variance not determined by spatial covariance. With these quantities (together with a correction for the structure of \( \Sigma \) called the Gower factor \( G(\Sigma) \)) we can move to a measure we refer to as “fraction spatial variance” (FSV), FSV \( = G(\Sigma) / (G(\Sigma) + \delta) \). This is a number between 0 and 1 which indicate how much of the variance in data is due to spatial covariance. To illustrate, we can randomly simulate data with different levels of FSV.


The simulated data in the panels have the same variance. But we can see that the spatial contribution to the variance decreases as the FSV decreases.

The SpatialDE test fits the \( \sigma^2 \) and \( \delta \) parameters to each genes observed expression levels, and also compares the likelihood with a model that has no spatial variance (FSV = 0) to obtain a significance level (p-value). This way it is possible to distinguish between genes with high variance and genes with high spatial variance. The \( l \) hyper-parameter for the \( \Sigma \) matrix is also fitted through grid search, so for each gene we also get the optimal length scale. (This dataset had 95 samples of 20,317 genes, and the SpatialDE test took about 5 minutes to run).

As an illustration, let us look at the expression patterns of a number of genes identified with different FSVs in the zebrafish embryo.


For the higher FSVs it is clear the expression depends on spatial location, while this is less clear for small FSVs. We can look at the relation between total variance and spatial variance, and how the FSVs are distributed compared to these.


On the right side the standard plot from SpatialDE illustrates the relation between the FSV and the statistical significance for all genes. I put in references to the genes I used as examples, as well as some genes that were highlighted by Junker et al in the original paper. In total SpatialDE identifies 1,010 genes as significantly spatially variable, with most of those genes having length scales between 65 µm and 244 µm.

Hopefully this one-dimensional example helps clear up what we mean by “spatially variable genes”, and also illustrate that there are many forms of data that can use spatial analysis. If you have spatial gene expression data, I recommend trying SpatialDE to identify genes of interest. The analysis associated with this post is available in a 电脑端能用的梯子. If you want to use SpatialDE for your spatial data, you can install it by 电脑用什么梯子. Thanks to Lynn Yi for editorial feedback on this post.

/Valentine Svensson

Actionable scRNA-seq clusters

Recently 手机梯子怎么给电脑用 published a blog post about "real" scRNA-seq clusters. The general idea is that if there is not differential expression between clusters, they should be merged. This is a good idea, and putting criteria like this highlights expectations of what we mean by clusters, and may in the future direct explicit clustering models that incorporate these.

The workflow Jean presents is similar to how I have been looking at these things recently, and the post inspired me to 1) write down my typical workflow for cell types or clusters, and 2) put my code in Python modules rather than copy-pasting it between Notebooks whenever there's new data to look at. Of course in this post I'll be making commands and images much more stylized than they would typically be, but the concept is representative.

Simulated data

To assess whether an approach is reasonable, it is good to make some simulations of the ideal case. To simulate cells from different cell types, I make use of 1) a theory that cell types are defined by "pathways" of genes, and 2) observations from interpreting principal component analysis from many datasets. These indicate that to produce a transcriptional cell type, a small number of genes are upregulated and covary among each other.

With this in mind, simulation is done by, for each cell type, creating a multivarate normal distribution for each cell type, which have increased mean and covariance for a defining "module" of genes. From this distribution a number of cells are sampled, and expression values are pushed through a softplus function to immitate the nonnegative scale and properties of log(counts + 1). The process is repeated for each cell type, leaving a number of genes as inactive background.

In this particular case I simulate 10 cluster spread over 1,000 cells, with 20 active marker gene per cluster, and finally add 300 "unused" genes.

from NaiveDE import simulate data = simulate.simulate_cell_types(num_clusters=10, num_cells=1000, num_markers_per_cluster=20, num_bg_genes=300, marker_expr=1.8, marker_covar=0.6)


Here a tSNE plot is used as a handy way to look at all the cells at once. The most unrealistic part of this simulation is probably the uniform distribution of cell numbers per cell type. In real data this is very uncommon.

To cluster cells it is pretty effective to work in the space of a number or principal components. I like to use Bayesian Gaussian mixture models to group cells into clusters in this space. First I will cluster with an overly large number of clusters.

from sklearn.decomposition import PCA pca = PCA(15) Y = pca.fit_transform(data) from sklearn.mixture import BayesianGaussianMixture def get_clusters(K): gmm = BayesianGaussianMixture(K, max_iter=1000) c = gmm.predict(Y) return c c = get_clusters(25)


In this tSNE plot each color correspond to a cluster. As Jean points out, for a cluster to be useful in followup experiments we must be able to define it with a small number of genes. That is, there should be some genes which will allow us to quickq手机版:ios好用的加速器软件下载手机电脑都能用的梯子安卓怎么下油管app shadowrocket ssr订阅国内怎么能用国外的软件安卓梯子挂俄罗斯自由浏览器有ios版吗谁有坚果vip账号.... For many this is a bit reductionist, and t is not impossible functional cell types are defined by hugely complex nonlinear interactions of hundreds of genes. But in practice, we wouldn't know what to do with such cell types.

The definition of an actionable cell type being one we can predict leads to predictive models. In particular regularised logistic regression is good for this. By controlling the regularisation) so that each cluster has a "marker budget" of only a handful genes, we can ensure that a few markers can predict the types. It is very rare to look at more than ~5 top genes when assessing clusters in practice.

ROC curves from the predicted assignment probabilites in logistic regression is a practical way to assess whether we are able to predict the cell types correctly. Here I split the data into 50% training and testing, train logistic regression, then create the ROC curve for the test data, for each cell type.

The printed numbers in the training command are the number of positive markers for each cell type. I interactively change the sparsity parameter to keep these numbers generally low. In this regard this is all quite supervised and subjective, and far from automated.

from NaiveDE import cell_types test_prob, test_truth, lr_res, lr = cell_types.logistic_model(data, c, sparsity=0.5) [26 24 34 31 8 31 10 10 42 24 17 8 10 39 17 39 5 7 12 7 21 19 10 9 14] cell_types.plot_roc(test_prob, test_truth, lr)


The colors of the curves here correspond to clusters. A number of the clusters are relatively close to the unit line, indicating that we have a hard time predicting these, and they will not end up being actionable. So we decrease the K and try again. We iterate this procedure until we are happy with the predictability of the clusters.


Here I have also plotted cases of "under clustering" the data. The larger clusters are still pretty predictice, but we would want to maximize the number of clusters which would be experimentally actionable.

In the case of K=10 (which is what we simulated), we can also look at which genes the predictive model is using for each cell type. We can use the get_top_markers command to extract the N largest weights for each class in a handy table.

top_markers = cell_types.get_top_markers(lr_res, 5)

To visualise how these weight relate the expression of cells with different cluster annotations, we can plot a "marker map", which sorts cells by cluster, and plots the top marker genes in corresponding order on the Y axis. This is a very common plot in scRNA-seq cluster studies.

www377116con + 重庆市南川市松诚辉梯子有限公司:2021-6-18 · 先进人员 更多 快讯:两市宽幅震荡沪指跌0.33% 工业大麻午后走强 2021-06-18 法国政府:希望在4月10日前能有明确的英脱欧延期计划 2021-06-18 上投摩根杜习杰:税延政策是个人养老投资重要驱动力 2021-06-18 欧盟据称针对英国退欧期限延长问题更 ...


We see that structure we simulated is largely recovered! The diagonal blocks indicate genes which predict the cell types.


Now we try this strategy on real data. In particular, we are using one of the batches of bone marrow data from the recently published Mouse Cell Atlas. This batch has 5,189 cells and expression values for 16,827 genes. For the sake of speeding up the analysis a little I randomly sample 3,000 of the cells.


Now we perform the same procedure of training GMM's and attempting to predict held out data with logistic regression.


At 7 clusters I stop, here the clusters are very easy predict. Again we can create the marker map.


Obviosuly this is a lot noisier than the simulated data. Another thing we notice is that the number of cells per cluster is much less even than for the simulated data. This will cause some issues with interpreting the ROC curves, but in practice we want to try to have some minimal size for clusters in order to keep them reliable.

In order to quickly interpret the clusters, and read out what the predictive genes are, the plot_marker_table command will lay out predictive weights and names of top marker genes for each cluster, the colors relate to the colors in the plots.

喜迎十九大,梯子不用请横着放(附蓝灯的安装) | web安全 ...:2021-10-12 · 作为一名学习渗透的小白,没有google实在表示很难受啊。因为从谷歌上找的教程众及国外技术大佬真的是666啊,还有找个什么网站测试的实在是太舒服了。 说起翻墙用的梯子,那还是三年前的记忆吧。当时想上油管看视频(油管就是youtube,具体为什么叫油管,你伔可众百度),用的还不是插件 …


I find this workflow fairly straightforward and quick to work around. There are some clear drawbacks of course: tt is quite manual, and we are not quantifying the uncertainty of these predictive weights, so we can't do proper statistics.

Notebooks for this post are available here.

/Valentine Svensson

The effect of Poisson zeros on OLS regression results

In a previous post I wrote about the Poisson distribution seeming like a good error model for scRNA-seq counts. This suggests using GLM with Poisson likelihood to analyse your data, as long as the offset due to count depth variation is taken into consideration.

An alternative strategy could be to transform the counts to roughly normal, and perform analysis in that setting. This is effectively what the vast majority of studies do for unsupervised analysis: counts are transformed, then PCA is used to find a low-dimensional representation for further analysis such as clustering.

What if we try to adjust for the count depth variation in a supervised setting assuming Gaussian noise?

A huge benefit of assuming Gaussian noise is that linear regression has an extremely efficient solution, usually referred to as OLS regression. A couple of years ago I made a simple Python package NaiveDE to perform OLS regression on gene expression matrices. I don't recommend anyone use it for final analysis, indeed I called it "Naive DE" because it is a baseline. Literally every other DE test will be better than it by design, in particular with regards to false positive P-values. (Well maybe not according to a recent study, the test in NaiveDE should be equivalent to the t-test.) It is nevertheless convenient during exploratory analysis to iterate through models

Alternative and null models are specified by Patsy formulas, and significance is calculated with a likelihood ratio test. A Bonferroni corrected version of the P-value is also reported.

For every gene \( g \) where we have a design matrix \( X \) and observed counts \( y_g \) we look at

$$ y_g = \mathcal{N}\left( \alpha^T_g X, \sigma^2_g \right). $$

The weights \( \alpha \) are calculated by OLS, and \( \sigma^2 \) is reflected in the residual errors. For flexibility, intercept is optionally part of the design matrix.

Negative control data

In the negative control 10X dataset from Svensson et al 2017, the only variation in observed expression should (in theory) be due technical effects, in particular the count depth variation. Here we are using 2,000 cells with 24,000 genes. The most common variance stabilizing transformation of scRNA-seq data is \( \log(Y + 1) \), so we will investigate how this affects regression.

If the gene counts are scaled per cells, we would want

$$ \log\left( \frac{y_g}{\text{counts}} \right) = \log(y_g) - 1.0 \cdot \log(\text{counts}) = \mathcal{N}(0, \sigma^2) $$

We set up a model where the design matrix \( X \) have the log total counts, and an intercept. Ideally the weights for the log counts should be found to be 1, and the intercept 0. Note that in practice we are always using \( \log(y_g + 1) \).

%%time lr_results = NaiveDE.lr_tests(sample_info, np.log1p(counts.T), alt_model='~ np.log(total_count) + 1', null_model='~ 1')

CPU times: user 2.98 s, sys: 702 ms, total: 3.68 s
Wall time: 3.35 s

The test produces a table with weights from the alternative model, hypothesis test restults.

print(lr_results.sort_values('np.log(total_count)', ascending=False).head(25))

                Intercept  np.log(total_count)           pval           qval
ENSG00000198938  -5.501025             1.080920  3.073563e-294  6.147125e-291
ENSG00000198727  -5.600405             1.041435  3.073563e-294  6.147125e-291
ERCC-00002       -2.999032             1.034389  3.073563e-294  6.147125e-291
ERCC-00136       -4.155633             1.017020  3.073563e-294  6.147125e-291
ERCC-00113       -4.297474             1.010625  3.073563e-294  6.147125e-291
ENSG00000198886  -5.615521             1.010178  2.134557e-266  4.269114e-263
ENSG00000198712  -5.144020             1.005586  3.341643e-168  6.683285e-165
ERCC-00096       -2.740023             0.989442  3.073563e-294  6.147125e-291
ENSG00000210082  -4.357098             0.988333  3.073563e-294  6.147125e-291
ERCC-00046       -3.727992             0.979269  3.073563e-294  6.147125e-291


In this plot np.log(total_count) does not refer to the value, but weight for this variable. Each dot is a gene rather than droplet. The P-value comes from comparing the model with one that does not consider the depth.

使用方法-几款实用的梯子介绍 - 发现好游戏 - 酷酷跑嗨圈 ...:2021-5-8 · 使用方法-几款实用的梯子介绍, 先给大家介绍一下VPN软件如何使用:一般安卓应用商店都有VPN软件,自己搜索下载就好!关于如何使用的话,那么久众我经常使用的“急速安全 VPN”为例,跟大家介绍一下如何使用:1、启动软件后界面是这样的,每次启动都会弹出这个提醒,想要免费使用的话点击 ...


We can look at a few examples of genes with different count depth weights.


From this, it is clear that increased observations on the low count values, in particular 0, are responsible for decrease in the total count weight.


Now let us investigate how this count depth effect plays in to a differential expression analysis. With all published large scale experiments cataloging cell types, it is getting increasingly easy to simply fetch some data and do quick comparisons. We will use data from the recent single cell Mouse Cell Atlas30116-8). To get something easy to compare, we use the samples called "Brain" and focus on the cells annotated as "Microglia" and "Astrocyte". Out of the ~400,000 cells in the study, these two cell types have 338 and 199 representative cells. On average they have about 700 total UMI counts each, so while the entire study is a pretty large scale, the individual cell types and cells are on a relatively small scale. The final table has 537 cells and 21,979 genes.


                            ClusterID Tissue    Batch        Cell Barcode  \
Cell name                                                                  

                                    cell_type super_cell_type  is_astrocyte  \
Cell name                                                                       
Brain_1.AAAACGCGAGTAGAATTA  Astrocyte_Mfe8 high       Astrocyte          True   
Brain_1.AAAACGGAGGAGATTTGC  Astrocyte_Mfe8 high       Astrocyte          True   
Brain_1.AAAACGGGCTGCGACACT            Microglia       Microglia         False   
Brain_1.AAAACGGTGGTAGCTCAA  Astrocyte_Mfe8 high       Astrocyte          True   
Brain_1.AAAACGGTTGCCATACAG  Astrocyte_Mfe8 high       Astrocyte          True   

                            total_count  gene  
Cell name                                      
Brain_1.AAAACGCGAGTAGAATTA       1088.0     0  
Brain_1.AAAACGGAGGAGATTTGC        967.0     0  
Brain_1.AAAACGGGCTGCGACACT        543.0     0  
Brain_1.AAAACGGTGGTAGCTCAA        679.0     0  
Brain_1.AAAACGGTTGCCATACAG        957.0     0

In a differential expression test you simply include a covariate in the design matrix that informs the linear model about the different conditions you want to compare. Here we are comparing microglia and astrocytes.

%%time lr_results = NaiveDE.lr_tests(sub_samples, np.log1p(sub_counts.T), alt_model='C(is_astrocyte) + np.log(total_count) + 1', null_model='np.log(total_count) + 1')

CPU times: user 705 ms, sys: 136 ms, total: 841 ms
Wall time: 707 ms


         Intercept  C(is_astrocyte)[T.True]  np.log(total_count)  \
Atp1a2   -1.925596                 1.840452             0.318532   
Sparcl1  -1.008002                 1.742278             0.179123   
Tmsb4x   -3.680027                -2.044908             0.948016   
Hexb     -2.165802                -2.032087             0.646263   
Ctss     -1.665139                -1.937761             0.553429   

                pval           qval  
Atp1a2   3.058918e-162  1.642639e-159  
Sparcl1  3.548817e-158  1.905715e-155  
Tmsb4x   2.742131e-153  1.472524e-150  
Hexb     3.671724e-145  1.971716e-142  
Ctss     8.167943e-144  4.386185e-141 


Also in this case we can see that the count depth weights are deflated for lowly abundant genes.


Similar to above, we can look at the relation between count depth and observed counts for a few genes, but we can also make sure to plot the stratifiction into the two cell types and how the regression models are predicting the counts.


Again we can see the overall abundance is related to the slope of the lines. Another thing which seem to pop out in these plots is an interaction between cell type and slope. For example looking at C1qa the slope for the microglia seem underestimated. This makes sense, if this is an effect of count noise at low abundances.

My takeaway from this is that OLS regression might be OK if counts are large, but at lower levels model parameters are not estimated correctly due to the count nature of the data.

Notebooks of the analysis in this post are available here.

/Valentine Svensson

Count depth variation makes Poisson scRNA-seq data Negative Binomial

In the scRNA-seq community the observation of more zero values than expected (called the "dropout problem") is still a concern. The source seems to be an intuition that at such small scales of biological material as RNA from individual cells, molecular reactions lose efficiency due to conceptual stochastic events. The trendiest computational research directions in the field at the moment are probably tied between "how do we do this for a million cells?" and "how do we deal with the dropouts?". In particular droplet based scRNA-seq methods are considered to have more dropouts, often leading investigators that opt for more expensive plate based methods even for exploratory pilot experiments.

In negative control data there is no evidence for zero inflation on top of negative binomial noise, counter to what is commonly suggested (in particular for droplet based methods). A notion that has inspired significant pc可众用的梯子 efforts. A recent interesting report by Wagner, Yan, & Yanai goes even further and illustrates that the Poisson distribution is sufficient to represent technical noise in scRNA-seq data. The authors write that additional variation in gene counts is due to efficiency noise (an observatin from Grün, Kester, & van Oudenaarden that different tubes of reagents appear to have different success rates), and can be accounted for by an averaging approach.

This can be explored by simulating data! Say droplets contain transcripts from 300 genes, whose relative abundance levels are fixed because they come from the same RNA solution. Then a droplet with \( d \) transcripts can be seen as a draw from a multinomial distribution,

\[ c_i \sim \text{Multinom} (d, (p_1, \ldots, p_{300})). \]

Now each gene will independently conform to a Poisson distribution.


The constant mean-variance relation for Poisson holds (as expected) for this simulation. In actual data, genes with higher abundance are over dispersed, which can be modeled using a negative binomial distribution.

The negative binomial distribution is constructed as a mixture of Poisson distributions, where the rate parameter follows a Gamma distribution. Other Poisson mixtures have also been suggested for scRNA-seq data.

An aspect of real data which our mutlinomial simulation does not account for is that the total counts observed in each droplet is variable. Indeed, usually a cutoff at some low number of total counts per droplet is used to decide which droplets captured cells and which only contain background material that is not of interest.


Thinking about the abundance levels of the different genes as pc端如何上外网 in Poisson distributions require each observation to come from a constant count depth. If the count depth varies in each observation but the model is not informed of this, it will appear as if the rate for each gene is variable, and this will be more consistent with a negative binomial distribution.

As an illustration, in the simulation, variation in count depths can be included. For simplicity, a uniform distribution is used,

梯子游戏输了把我的前程都毁了,有什么技巧和方法能赢回来 ...:2021-9-9 · 今年世界杯开始接触了网赌!开始是玩玩球赛,其间输了两W左右,后来就不玩了!但是由于无聊,又重新玩!偶然间发现了软件里面有一个电子游戏的!就好奇进去玩!开始几K几K的赢,反正赌球输的钱都能在老虎机那赢回来!


These new values clearly have the quadratic polynomial mean-variance relation that is typical for scRNA-seq counts.

This indicates we need to handle the differences in count depth. The easiest solution is to simply divide the expression counts in each cell with the total depth, turning each expression value into a fraction.

In the RNA-seq field it is also common to also multiply these fractions by 1 million to form the "CPM" unit.


It is clear that after creating either fractions or CPM will follow a linear relation between mean and variance. However, in both cases there is an offset from the unit relation, and in particular for the CPM unit the variance gets inflated compared to the mean.

The thrid panel shows the result after manually scaling the fractions (through multiplication by 3.5e4) to achieve the Poisson mean = variance relation. (There is probably a closed form expression for the scaling factor that achieves this, and the 1e6 is above this, explaining the variance inflation.)

It is entirely possible that the this type of scaling to create CPM from fractions is one reason people have noticed higher than expected numbers of zeros. For Poisson data, the expected number of zeros at a given mean expression level is given by the function \( e^{-\mu} \).


The counts themselves follow the theoretical curve quite close, but with an increase of zeros at high expression levels, consistent with negative binomial zeros. 'Fractions' see a large offset of much fewer zeros than is expected given the mean, while CPM see an offset for more zeros than expected. The manually scaled values follow the theoretical curve decently, though far from exactly.

For interpretable analysis, counts should be scaled for total count depth, but this also need to be taken under consideration when looking at the results (e.g. dropout rate). The best solution might be to take inspiration from the field of generalised linear models. In that field offsets are included in models when there is a clear explanation for variation in counts, to convert counts to pc端电脑梯子. Clustering or pseudotime methods could be reformulated to the Poisson setting with offsets.

There are some additional aspects to keep in mind. For negative control data where each droplet contains RNA from the same solution, the count depth variability must be technical, but in real samples this could also be due to cells having variable amounts of RNA. For droplet based data one simple reason for the heterogeneity could be due to variation in coverage for DNA oligos on barcoded beads. It is not clear what an explanation for plate based methods would be, and no proper negative control data exist for plate based methods to investigate these properties. On a similar note, the latest single cell sequencing methods based on stochastic schemes for in situ barcoding of cells are impossible to assess with negative control samples.

An R notebook for this analysis is available here. Thanks to Lior Pachter for editorial feedback on this post.

/Valentine Svensson

谁有能用的梯子我更新一下_Moto Z 系列-联想社区:2021-1-8 · 谁有能用的梯子我更新一下 产品 服务与支持 手机商城 社区 ZUI 登录 | 注册 首页 > 版块 > Moto Z 系列 > 谁有能用的梯子我更新一下 ... 线路多,用起来简单,一键连接,适用于电脑端和手机端,可众多人一起用。已经用了2 ...

Some sequencing centers have moved to only work on BAM or CRAM files rather than "raw" FASTQ files. The motivation for this is that CRAM files can be heavily compressed and require less memory for the sequencing service provider. The CRAM files in particular make use of the alignment to a reference genome to achieve better compression performance, especially when the files are sorted by alignment coordinate.

In many forms of sequencing analysis, in particular genetics, coordinate sorted alignments to a standard reference is so standardized it is considered a "raw data" format. Unfortunately, this is not the case for RNA sequencing, in single cells or other. As demonstrated by the previous posts in the series, we are often not even sure what material we have been sequencing. The choice of what cDNA's are in the reference is also an informed choice depending on the experiment. For these reasons, software for working with RNA-seq data require FASTQ's as input.

To convert BAM's/CRAM's to FASTQ, a handy command from the samtools suite is samtools fastq. It is however not mentioned in the documentation or in the help for the command itself that is assumes name sorted BAM/CRAM input. It will not even stop or warn you if your file is coordinate sorted. Instead, it will silently create paired FASTQ files with incorrect read orders.

This issue was recently raised by Davis on twitter.

In particular, the incorrectly generated FASTQ files will have worse performance in terms of mapping statistics, caused by read pairs not originating from the same cDNA fragment. We can compare the result on the same data and reference as we used before, after converting files with name sorted and coordinated sorted CRAM files.


We see that cells which would have a mapping rate of > 75% only have ~40% mapping rate with the incorrect coordinate sorting.

Thomas Keane recommends using the samtools collate command before converting to FASTQ to quickly ensure reads are correctly ordered.


$ samtools view -H 20003_4#70.cram | head | cut -c1-50
@HD    VN:1.4    SO:coordinate
@RG    ID:20003_4#70    DT:2016-06-07T00:00:00+0100    PU:1
@PG    ID:SCS    VN:2.2.68    PN:HiSeq Control Software    DS:
@PG    ID:basecalling    PP:SCS    VN:    PN:RTA    DS:B
@PG    ID:Illumina2bam    PP:basecalling    VN:V1.19    CL:uk.
@PG    ID:bamadapterfind    PP:Illumina2bam    VN:2.0.44    CL
@PG    ID:BamIndexDecoder    PP:bamadapterfind    VN:V1.19
@PG    ID:spf    PP:BamIndexDecoder    VN:v10.26-dirty    CL:/
@PG    ID:bwa    PP:spf    VN:0.5.10-tpx    PN:bwa
@PG    ID:BamMerger    PP:bwa    VN:V1.19

Similar to the case of unsorted FASTQ files, you can also notice this by read headers not matching up in the FASTQ pairs.

$ head 20003_4#70_*.fastq | cut -c1-50
==> 20003_4#70_1.fastq <==

==> 20003_4#70_2.fastq <==

This post was suggested by Raghd Rostom and Davis McCarthy.

/Valentine Svensson


As scRNA-seq (singel cell RNA sequencing) started to gain popularity users expressed concern about an unexpected number of zero values among gene expressions. That is, for any given gene many cells had not detected the expression, even if it was relatively high in other cells.

It is unclear when this was originally stated, but it has been named the "dropout" problem. A search on Google Scholar will give hundreds of publications discussing the problem of "dropouts" in scRNA-seq, and there are several methods papers explicitly aimed at investigating and dealing with the "dropouts". Typically by imputing zeros to positive values or by stating models which includes a zero-inflation component. These observed zeros ("dropouts") in the data have typically been explained by inefficiencies of molecular reactions, due to the very small volumes of mRNA in individual cells.

In high throughput variants of scRNA-seq assays cells are isolated in (reverse) droplets, within which several molecular reactions occur to eventually give rise to labeled cDNA from expressed genes from each cell. Part of what makes this possible is limiting the sequenced fragments to just single tags from the 3' or 5' end of each transcript. It has recently been observed in statistical analysis that RNA tag counting versions of scRNA-seq data is better explained without additional zero inflation.

Nevertheless, it is common to hear weariness from potential usesrs of droplet based scRNA-seq assays because they are percieved to have a higher amount of "dropouts" than alternative more expensive and manual methods with less throughput.

These observed zeros are consistent with count statistics, and droplet scRNA-seq protocols are not producing higher numbers of "dropouts" than expected because of technical artifacts.

To see this, consider four experiments were solutions of RNA were evenly distributed into droplets, guaranteeing a complete lack of biological variation. One was performed with inDrop (Klein et al), one with 10X Genomics GemCode (Zheng et al), and two with 10X Genomics Chromium (Svensson et al. All datasets have on the order of ~1,000 droplets with RNA, facilitating accurate estimation of e.g. mean or variance for each gene.

It has been observed that expression counts from these technologies follow the negative binomial distribution, in which there is a quadratic mean-variance relationship.


Compared with experiments involving single cells the mean-variance relation is extremely clear in these homogenous datasets.

使用方法-几款实用的梯子介绍 - 发现好游戏 - 酷酷跑嗨圈 ...:2021-5-8 · 使用方法-几款实用的梯子介绍, 先给大家介绍一下VPN软件如何使用:一般安卓应用商店都有VPN软件,自己搜索下载就好!关于如何使用的话,那么久众我经常使用的“急速安全 VPN”为例,跟大家介绍一下如何使用:1、启动软件后界面是这样的,每次启动都会弹出这个提醒,想要免费使用的话点击 ...

\[ P( k \,|\, \mu, \phi) = {k + \phi^{-1} - 1 \choose k} \left( \frac{\mu}{\mu + \phi^{-1}} \right)^k \left( \frac{\phi^{-1}}{\mu + \phi^{-1}} \right)^{\phi^{-1}}. \]

梯子登山者下载 梯子登山者电脑版下载:2021年6月24日 - 梯子登山者是一款休闲动作游戏,您可众通过简单的触摸...一、下载游戏与工具时请务必使用迅雷7,在下载前建议...我的世界神奇宝贝1.15.1手机版 电脑版 我的世界...

\[ P( k=0 \,|\, \mu, \phi) = \left( \frac{\phi^{-1}}{\mu + \phi^{-1}} \right)^{\phi^{-1}}. \]

With this function we can visualize theoretical "dropout" rates for various means and dispersion values.


These values can be compared with the empirical "dropout" rate, simply calculated as

\[ \frac{\text{# cells with gene = 0}}{\text{# cells total}} \]

for each gene.


The "dropout" rates for the data without biological variation follows the theoretical prediction. In all datasets the Pearson correlation between theoretical and empirical dropout rates is 99.9%.

Here the \( \phi \) parameter is different for each dataset, and it is possible that this overdispersion is affected by technical factors. There does however not seem to be any technical contribution to zero-inflation, if it is observed it is instead more likely caused by biological heterogeneity.

/Valentine Svensson


Quantitative sequencing assays in general yield counts. The generative models for different levels of counts are in many ways fundamentally different from continuous distributions such as the more common Gaussian (normal) distribution. The problem is not that the data consists of integers; rounded normal data such as e.g. user ratings of products wouldn't have any particular problems being analysed with normal methods. Counts however are generated by several, cumulative, singular events. With each of the events having some probability of occurring in a given "time interval" or other relevant unit.

For example, if we think about the RNA sequencing process in the abstract we have a large collection of cDNA molecules which are randomly sampled and identified as originating from genes. How often moleculas are identifed from any particular gene tells us something about the abundance or expression level of the gene. The process of counting however implies that variation will propagate as the number of events increase. The effect of this is that there will be an inherent relation between mean (expected value) and variance of counts.

As an example, let us look at two recent datasets. One from Karaiskos et al where the authors mapped out fruit fly development on the single cell level, and another from Stoeckius et al where the authors developed a new method to study RNA and protein expression from cells in tandem.


The mean-variance relation typical for negative binomial distributed count data is quite clear. Negative binomial, or other similar distributions, have been used to study RNA-seq data for a long time. Almost all statistical tests for comparing control vs condition style experiments (differential expression) use generalized linear models assuming count data with these kinds of distributions.

Single cell RNA-seq data is different. Not necessarily because the data wouldn't be suitible for these tools, but rather because differential expression is a minor question of limited interest in single cell studies. By far the most popular use of scRNA-sequencing is to identify groups of cells which are similar to eachother and might correspond to functionally distinct cell types. In addition to such clustering analysis, inference of developmental trajectories are popular, as well as quantifying the degree of variation between conditions.

So unlike bulk RNA-sequencing, the key analysis modality is in terms of multivariate analysis such as clustering or "dimensionality reduction" like PCA. In the coming years I believe figuring out effective ways to think about these issues for count data will be important, especially for sparse counts from low depth!

In the meantime, it is useful to be able to use available methods. Existing methods for clustering or dimensionality reduction are almost always either explicitly or implicitly designed with normal data in mind. (A notable exception being ZINB-WaVE by Risso et al). Any method using Euclidean distances implies normally distributed data.

One way to deal with these problems is to pc可众用的梯子) in some way which makes it more similar to normal data. The logarithm is a very practical transformation for positive data. In particular ratios are very useful to log transform. With counts though, it is common to observe values of zero, for which the logarithm is not defined. Instead, it is common to perform \( \log(x + 1) \) for counts \( x \).

If we apply this transformation to each gene \( i \), we can investigate a couple of things. First of all, we can see what happens with the mean-variance relation. Secondly, we can display the first few principal components for both of the data sets using the transformed unit. This will represent a form of multivariate analysis.


From the plots we can observe a few things. For higher mean counts the relation between mean and variance (or standard deviation) is gone. However, for lower counts (mean \( < 1 \)) there is still some correlation. From the PCA below we can see some subpopulations for each data set.

只有手机,没有电脑。怎么防止别人蹭自己的网呢?_百度知道:最佳答案: 大部分路由器品牌如水星、TP、迅捷、斐讯都有手机app可供下载,在app里面可众控制管理联网设备,这样就可众防止别人蹭网了。更多关于手机电脑都能用的梯子的问题>>

There is actually some theory we can use here. Our goal was to transform the data in way that removes the mean-variance relation as effectively as possible. This is known as a variance-stabilizing transformation. For example, in bulk RNA-sequencing the DESeq2 package has a function pc端如何上外网 for this (based on the underlying parametric Poisson-Gamma model).

If there is a functional form for the relation between the mean and the variance, e.g.

\[ Var(x) = g(\mu), \]

then the variance can be stabilized by applying the function

\[ f(x) = \int^x \frac{1}{\sqrt{g(v)}}dv. \]

As illustrated in the first plot of the post, for negative binomial data, which generally suits scRNA-seq counts well, we have that

\[ Var(x) = \mu + \phi \cdot \mu^2, \]

where \( \phi \) is the dispersion for the data. If we plug this into the integral above, and use Wolfram Alpha to solve the integral because I'm not in school anymore, we get

\[ VST(x, \phi) = 2 \cdot \frac{\sinh^{-1} \left( \sqrt{ \phi \cdot x } \right)}{\sqrt{\phi}} \]

It is very easy to find a \( \phi \) for the data by fitting a polynomial to the observed mean-variance relation. Let's transform our data in this way, and redo the plots for the two data sets.


We can note that this transformation behaves very similarly to \( \log(x + 1) \). One difference is that the standard deviation is scaled around 1 rather than around 0.5. Any effect on the PCA seems minimal though.

In a 1948 paper Anscombe explored these sorts of tranformations for Poisson and negative binomial data. In addition to the \( \sinh^{-1} \) form of the solution to the integral, Anscombe also considers an approximation which works for certain ranges of mean and \( \phi \). The approximation has the form \[ \log\left( x_i + \frac{1}{2 \cdot \phi} \right). \]

This has the same form as the heuristic \( \log \) transform, but instead of just picking 1 the "pseudocount" is motivated by the data distribution and statistical theory. We also create the same plots as above for data tansformed in this way.


Again the data looks similar to before, and we're back to the situation of having standard deviation around 0.5 for highly expressed genes.

In both of these cases we have assumed a global \( \phi \) parameter, and found it by polynomial curve fitting of mean vs variance. This is handy because we can use information from all genes and learn something global about the data. Then each gene is transformed assuming a fixed dispersion level.

For these data sets, there's actually no real need to assume a global \( \phi \). They have thousands of cells providing observations for each gene, we can easily learn individual \( \phi_i \) for each gene by maximum likelihood. That means we can perform VST of each gene independent of the other genes.

(It should be mentioned that finding all the \( \phi_i \)'s took about 40 minutes for the larger of the data sets, so it's not extremely practical).


When plotting these independent VST values we first see that the relation between mean and standard deviation is much "tighter" for both data sets. We still get an interesting bump for lower expression values, but after a mean of 2.0 the standard deviations are stable at 1.0. (A problem of course is that it's hard to know what "2.0 expression" means here, but it seems somewhat comparable between the two data sets).

Here we notice that the low-dimensional representation in the PCA is different from the previous data transformations. For the left data, we don't see clear clusters anymore, while for the right data some within-cluster covariance seem to follow PC1 better.

Finally, we can try to perform the approximate Anscombe transformation for gene specific \( \phi_i \) values.


It's hard to say what we are seeing here. There's definitaly no apparant correlation between mean and variance after the transformation. Though the standard deviations are not particularly stable around a value. For very high mean expression, values are transformed to have 0 standard deviation, meaning they are probably transformed to a constant value.

The multivariate PC analysis shows similar results as the first few transformations.

3款寻宝游戏,人多人少都能玩,出发!-幼师宝典官网:2021-1-21 · 7.爸爸妈妈每天用来操作电脑的物品。 8.爬上房间里的梯子去找找看吧(宝宝都喜欢爬高上低的嘛,这是爬高训练) 9.妈妈的钢琴,并正确念出妈妈给出的字母。(如果孩子会弹钢琴,可众让Ta弹一首近期练习的曲子,这里的环节可众变通成孩子近期正在学习的

One note though, is that in all cases (except the last) there still is a dependency between mean and variance for genes with very low means. Considering that scRNA-seq as a field is moving towards more cells rather than more counts per cell, this might mean that variance stabilizing transforms are the wrong way to go in modern studies. Instead working directly with count distributions might be a more stable strategy for low counts. There of course is very limited prior work on this, and that is good to keep in mind when working with and planning to make shallow scRNA-seq data.

/Valentine Svensson

Low mapping rate 5 - Human DNA contamination

This is (most likely) the final post in the series investigating the low mapping rate of our Smart-seq2 data from our study on the malaria immune response. If you have read the previous posts, you might have notied a population of cells which have been stuck at extremely low mapping rate, no matter how much things improved for the other cells. It turns out this population of cells are contaminated with human material.

In our study we invsetigated CD4+ T cells sorted from spleens of infected mice. There are a number of potential entry points for human material to contaminate the samples.

  1. Human cells can be sorted with mouse cells.
  2. Human material enter the plate of cell lysate during cDNA generation.
  3. Human material can end up in the plate when creating the DNA sequencing library.

最近是不是墙都翻不了啊 - 点滴生活 - 极速社区:2021-11-12 · 去年的搬瓦工,昨天刚续费,结果还是上不了谷歌 最近是不是墙都翻不了啊 ,极速社区 目前安全的只有v2ray和brook(估计拼错了) 就算你用v2ray也要用网站伪装下 否则一个月稳稳的被封 搬瓦工这种速度慢还要钱 而且用的太多坏处也太明显

To analyse this, I added all human Gencode transcripts to the Salmon reference from the previous post, along with a human 18S rRNA sequence. This will account for the first two possibilities. For the case of human DNA contamination, I extracted the unmapped reads from Salmon and aligned them to the human genome with HISAT2. With the remaining unmapped/unaligned reads I calculated the final mapping rate.


It is clear that we hace solved the mystery of the extremely low-mapping population! By breaking up the mapped reads into the sources of contribution like we have done before for each sample, we can see which have the contamination cases have happened.


In the entirety of the plate 20003_6, as well as stretches of the plate 20003_8 we see that by for the greatest contribution of material is from human intergenic DNA, suggesting that the contamination happened during library preperation.

At this point I want to illustrate how much of the reads we have no found an explenation for compared to the original reference.


电脑用的梯子:免费pc梯子软件飞机怎么登陆twitter usb网络加速器穿梭伕理官网梅林ssr插件离线安装包老王vpn为什么用不了老王vpm - 百度手机vp收费跨墙软件 seed加速怎么能使用国外...

To summerise, the Salmon reference now contains Mouse Gencode genes, ERCC spike-in sequences, mouse ribosomal RNA, a TSO concatemer "bait" sequence, Pseudomonas 16S and 23S sequences, human Gencode genes, and human 18S. Additionally, the unmapped reads are aligned to the human genome.

【图】安卓手机上的梯子,怎么分享给pc用?_1_四川论坛_爱卡汽车:安卓手机上的梯子,怎么分享给pc用?(1/1) 楼主:0z 返回帖子 评论图片 分享 手机上有好使的梯子,没有pc版本 有什么app能把手机上的梯子分享出来给pc用? ...

/Valentine Svensson

Low mapping rate 4 - Bacterial contamination in reagents

The large amount of amplification needed to get usable material from single cells cause us to get detectable signal from minor effects such as ribosomal RNA or TSO concatemers. This also means reagents need to be particularly pure. Several studies have 电脑用什么梯子 the bacterial contents in kits used for sequencing studies, investigating the 'kitome'.

Reagents might have a small amount of contamination which will not be detectable in the "bulk" assays with more input material it was designed for, or even for relatively large single cells.

In particular, some commercial reverse transcriptases have been reported to contain contaminating bacteria. pc可众用的梯子 is generally recommended for anyone using scRNA-seq. A few years ago multiple people reported electropherograms which showed different sample purity depending on the reagent lot. This is an illustrateive plot reported by user bplevi:


The thread contains more examples good vs bad lots of reagents. They also reported which lots they had had success or failure with. For the data analyzed in the previous posts of the series, SuperScript II has also been used, though I don't know the lot number.

The company producing SuperScript II later acknowledged a couple of lots where contamination of E. coli had beed detected.

Analysing the reads from my data, I didn't see contribution from E. coli, however, I did find contamination from Pseudomonas. This bacterial family have also been reported in papers describing reagent contamination.

I added sequences for 手机电脑都能用的梯子 16S and 23S to the Salmon reference, reran the samples, and compared the results to the previous iteration.


With this addition the mapping rate for the majority of the cells jump up to ~75%!

求个电脑和苹果手机用的梯纸..就v、pp那个嗯.._百度知道:最佳答案: 之前用的要么跑路要么坑爹全卸载了,现在固定用坚果,好不好用lz用过应该知道更多关于手机电脑都能用的梯子的问题>>


It is quite interesting that some plates have a larger contribution of TSO concatemers while others have larger contributions of bacterial sequences. All these plates are suppowed to have the same lots of reagents. However, the plates are sorted on different days then stored in a freezer until the experiment is done. It might be the case that TSO products and bacterial products degrade differently due to this.

(I should also acknowledge Luisa for pointing me to various bacterial resources.)

/Valentine Svensson

Low mapping rate 3 - TSO concatemers

Compared to the two 求一个电脑上能用的梯子 posts in the series, this post deals with something more technology specific.

为啥我用shadowsocks搭的梯子只能用wifi登外网,用4g就登不了呢有...:1楼: 为啥我用shadowsocks搭的梯子只能用wifi登外网,用4g就...8楼: 线路不行还用说,这边稳定多线路,秒开不限速,全平台...

Once cDNA have been reverse transcribed from mRNA from a cell, it can be pre-amplified if it has PCR adaptors at both ends of the cDNA. A particularly convenient way to add these adapterors is though pc端如何上外网.

Template switching oligos

Here when reverse transcription reaches the 5' end of the RNA, a CCC sequence is added. This allows a DNA oligo with GGG at the to bind to the end if the cDNA. This oligo allows the second strand of the cDNA to be generated, and at the same time provides as an adapter for PCR primers.

In the standard implementation of the Smart-seq2 protocol the template switching oligo (TSO) is AAGCAGTGGTATCAACGCAGAGTACATGGG.

Sometimes these TSOs concatenate to longer DNA sequences, and get amplified along with the cDNA. If you investigate reads not mapping to the transcriptome or rRNA you will find a number of reads whic have the TSO repeated after itself multiple times.

The TSO concatemers can be accounted for during quantification by including a FASTA record of a TSO concatemer in the reference, like this one:


Rerunning Salmon with the new reference, we can compare the mapping rates to those in the previous post:


As we can see, the majority of cells get increased mapping rate when including the TSO concatemer. And many cells go from single digit percentages to over 50%! These samples are likely wells with almost no cellular mRNA in.

As before we can visualize the relative contribution of fragments from the different sources (here I merged the rRNA genes expression to one unit).


谁有梯子,手机端的,求发_好运百科:2021-7-12 · 话题:求梯子啊!问:现在都没有能用的V p n,有没有学生党承受的起的埃 推荐回答:要手机的还是要电脑的 话题:请问<梯子游戏>手机客户端怎么下载?推荐回答: 사 话题:有没有在手机上可众玩的(梯子游戏)??推荐回答: 관 话题:求一个手机

To quickly investigate different concatemers in data I created a little 手机能用的梯子免费 to our readquant collection which counts the number of occurances in reads from FASTQ files.

$ fastq/20003_3#57_1.fastq fastq/20003_3#57_2.fastq AAGCAGTGGTATCAACGCAGAGTACATGGG

A typical strategy when investigating low mapping rates is to BLAST unmapped reads. Many times this will give results from scaffolds of the common carp genome (pc可众用的梯子). Actually, if you simply BLAST a 3x concatemer of TSO, it will map all over the carp genome with 100% similarity

Screen Shot 2017-09-07 at 00.22.08.png

Finally, I should mention that Smart-seq2 isn't the only protocol making use of template switching. It is also used in STRT-seq, the different flavours of Drop-seq (e.g. SeqWell, DroNc-seq etc) as well as in the very popular 10X Genomics Chromium single cell solution.

/Valentine Svensson