Welcome to WhatIsMyGene

Over a period of more than 7 years, we’ve accumulated data from over 30,000 biological studies; transcripts, proteins, micro-RNAs, and more. Rather than developing new algorithms for aligning experimental data with well-circulated gene lists (e.g. GO lists), our focus has been on the brute accumulation of data. We note that recent leaps in the utility of tools such as Chat-GPT have been largely fueled by the availability of massive training sets. We’ve thus constructed the largest transcriptomic database of its kind. We also house massive quantities of processed proteomic, micro-RNA, chip-seq, rip-seq (etc) data, all of which can be subdivided according to cell type, disease, knockout, drugs and much more.

Here we provide a number of tools to extract insights from this database. Whether you’re investigating the functions of a single gene or a large set of genes, you may find these tools indispensable to experimental interpretation and hypothesis generation.

Several of our tools access large databases and perform intensive computations. In most cases, though, you should see output in a minute or less.


Become a Patron!

Relevant Studies

This is the simplest and most intuitive of our tools. Here, researchers can ask the basic question, “In what studies is a biological entity (transcript, protein, micro-RNA, etc) upregulated, downregulated, modified, or otherwise found in a list of interest?” While a standard Google Scholar search may tell you when your gene has a starring role in a paper, this search will not tell you when the gene is buried in a supplemental dataset. Enter your favorite biological identifier and see what happens!

Given the size of our database, the sheer volume of studies relevant to your particular entity (hereafter referred to as a “gene”) may be overwhelming. Use the various filters to restrict this output. You can restrict the search to cases where the gene is only upregulated (or downregulated), to a particular species, to a particular type of molecule (protein, transcript, micro-RNA, etc), or to a particular experimental setup (knockdown, overexpression, drug response, disease, etc). You may also enter a keyword of your choice (“cancer”, “metastasis”, “virus”, “liver”, “covid-19”, etc.) which must be found in all studies, or must be excluded from the studies. Another filter is the "Restrict IDs" option. Since most of our data is sorted according to some criteria (significance, fold-change, etc.), this feature allows you to eliminate low-ranked genes from analysis. For a complete description of this feature, check out the relevant page on our blog. A final filter you'll see is "Emphasize Internal Significance." Here, you can discard studies in which genes were not significantly altered. Again, for a deeper discussion of this feature, check out our blog.

If you enter two identifiers, our algorithm requires that both identifiers be found together in a list within our database. If three or more identifiers are entered, only the first two will be examined; this is because entering two, in most cases, already strongly diminishes the chances of getting any output at all. If, on the other hand, no gene symbol is entered, you will get a list of up to 100 studies that match the filters you've set.

Scanning the descriptions of the various studies is a powerful way to home in on the functions of a gene. The tool is also useful for finding studies (via "dbase_id") that you may wish to further analyze with other tools.

Fisher Analysis

Enter a list of gene symbols and search for studies that best match this list. Or, by using the “anti-correlation” option, search for studies that fail to intersect your own (see below for a more detailed explanation). The output is a list of studies ranked by log (P-value). The study with the largest negative P-value best overlaps your own. These P-values are generated via Fisher’s exact test. One critical point; the output P-values are approximations of rigorously-derived Fisher values and should not unquestioningly be submitted for publication. See “Exact Fisher Analysis” below for an explanation of our use of this statistical method.

If you're interested in further examining a dataset found in our own database, you may also simply enter the ID ("dbase_id") we've given to a study. Our internal background figure for the study will override user-entered background

The background of a study is the count of all positive identifications. A typical, modern microarray allows identification of about 22,000 RNA species. A typical, modern mass-spec experiment allows identification of 5,000 proteins. Enter the background of your study in the “background” box; minor errors in this figure will not result in large changes in P-values. Even large errors, in fact, will not strongly alter the rankings (best intersection, second best, etc.), though the P-values will indeed change.

As with our “relevant studies” analysis, various filters may be applied. Use the filters to restrict this output. You can restrict the search to cases where the entity is only upregulated (or downregulated), to a particular species, to a particular type of molecule (protein, transcript, micro-RNA, etc), or to a particular experimental setup (knockdown, overexpression, drug response, disease, etc). You may also enter a keyword of your choice (“tumor”, “survivors”, “resistant”, “aorta”, “SARS”, etc.) which must be found in the all studies, or must be excluded from the studies.

Two studies may generate significant P-values in two different ways. Most commonly, a researcher is interested in significance that arises via a strong intersection between two studies. However, significance may also arise via a lack of intersection. In the latter case, we reverse the sign of the log(P-value). Thus, a study that strongly lacks intersection with your own will have a positive log value. These results are seen when the “anti-correlation” option is chosen. Here, we require the user to make a selection from the "Molecule" filter. It's not surprising, for example, that a protein dataset and a micro-RNA dataset have zero overlap; this lack of overlap could be very significant in a statistical sense, but it's also entirely un-interesting. Thus the requirement that you choose a Molecule type. Do not assume that there's no overlap at all between sets when a positive log value is seen. In one test we ran, the expected overlap between sets was 42 IDs, but the actual overlap was 36, resulting in the positive value. Note also that it's not surprising at all to have zero overlap with a joint background of, say, 20,000, and two datasets of around 150 genes each; strongly positive values will only be seen when one set has a small background, at least one set is large (over 400 genes), or a combination of both.

Creative use of filters can be explored. If you have a protein dataset, you may wish to ask whether transcriptomic datasets overlap strongly; in the “molecule” box, select “transcripts” and competing protein-based datasets will not be examined. You may wish to know if your human transcriptomic dataset is reflected in studies of more primitive organisms; select “c elegans” or “yeast” as “species” (don’t expect amazing P-values in this case; nevertheless, insights may be gained). If you have a micro-RNA dataset, you can select “transcripts” to eliminate all-micro-RNA-specific (i.e. arrays) studies and thus focus on RNA-seq datasets where micro-RNAs were mixed together with standard mRNAs and lncRNAs.

Coexpression

What genes tend to be upregulated when your own gene of interest is upregulated? What genes are downregulated when your gene is upregulated? These kinds of questions may be explored with our co-expression analysis feature.

In brief, following application of any filters you desire, our program scans all studies in our database for appearance of your gene. It also records the frequencies of genes that accompany your gene. We calculate the expected appearances of these genes as well; genes that are commonly involved in regulatory processes (e.g. cyclin-D, STAT1, etc.) tend to be widespread in our database, while other genes are relatively rare. The expected appearances of these genes is compared against the actual appearances via the binomial equation. Genes that are significantly over-represented in association with your own gene are output as a table. In most cases, your own gene will be listed first in the table, with a very significant or “0” P-value alongside (that’s because the program is treating your gene of interest as a potential coexpressed gene). This “self P-value” could theoretically be less significant than the P-value that results from the association of your gene with a different gene…this unusual case would be caused by the frequent association of your gene with a gene that is otherwise rare over the entire database.

You may actually enter two genes in the search box. Here, our algorithm will simply scan for all studies in which the two genes are both found. For a fuller discussion of this option, check out our blog.

Most likely, you are interested in genes that are co-regulated (up or down) in relation to your gene…you may not be interested in the possibility that your gene and another gene are related by the fact that they both code for proteins with a high density of proline (a list that is indeed found in our database). Eliminate these irrelevant lists by using the “regulation” feature or by using “regulated” as a keyword. If, for some reason, you wish to search all our data (e.g. include the list of proteins with high proline density), select “Any” as the “regulation” filter.

Anti-coexpression occurs when your gene is upregulated and another gene is downregulated (or vice-versa). The question of “anti-coexpression” may be addressed in two ways. One method is to note which genes are absent when your gene is present in a dataset. Another is the following: when your gene is upregulated (data column 1), note which genes are downregulated (data column 2), or vice versa. We employ this second approach when “anti-co-regulation” is selected. To be clear, if the user selects both the “anti-coexpression” and “upregulation” options, a list of genes that are downregulated when your gene is upregulated is created. To be superclear: if your gene is CDK4, selection of “anti-coexpression” and “downregulation” will generate a list of genes that are upregulated when CDK4 is downregulated. When using the “anti-coexpression” option, the “regulation” filter options “Any” and “up or downregulated” give the same result, as “anti-coexpression” only makes sense when a list of upregulated genes is complemented by a list of downregulated genes.

If your gene of interest is quite rare, it may not appear in our database frequently enough for coexpression data to be generated.

Micro-RNA inputs represent a special case. This is because our database includes numerous (about 450) studies that solely detected micro-RNAs (via micro-RNA arrays). If you input a particular micro-RNA without application of additional filters, the output list will be dominated by other micro-RNAs. This is because of an abundance of micro-RNA only studies. We thus suggest use of the “molecule” filter, set to either “Micro-RNA” or “transcript.” In the former case, only co-expressed micro-RNAs will be output; in the latter, all studies containing transcripts (which may or may not include micro-RNAs) will be examined.

Regardless of our algorithmic and statistical approach, coexpression analysis depends on a fairly random selection of studies. If, for example, we were inordinately fond of studies involving T-cells, to the detriment of other types of studies, P-values might be skewed. Gene X may be profoundly associated with coregulated partners in epithelial cells, but the focus on T-cells would hamper this understanding. Our approach to adding new data, however, is basically “grab whatever is available”, minimizing this concern. Also, we largely (though not entirely) avoid the generation of multiple lists generated from a single study (e.g. upregulated transcripts on 6, 12, 24, and 48 hours of IFN-treatment…4 upregulation lists that will likely overlap with each other to a large extent). Even assuming a random approach to database construction, however, the field of molecular biology itself may be subject to fads. Use filters to remove some of the bias that might arise from a focus on particular cell types, study types, molecule types, and more. In some cases, for example, we store both transcriptomic and translatomic data from a single study, with strong overlaps between the two sets; by selecting the “transcripts” filter, the translatomic data will not be examined.

Because of the potential skewing of coexpression analysis by repetitive studies, we do not enter data from “targeted” transcriptomic (e.g. “Nanostring”) or proteomic (small-scale protein arrays) studies into our database.

Use the coexpression feature creatively. There are numerous genes whose functions are unknown; one can infer function by examining the genes that are coexpressed with the unknown gene. To take this exercise a step further, you can enter the list of coexpressed genes into the Fisher analysis program.

Matching Studies

Bioinformatics datasets typically contain an upregulated subset and a downregulated subset. Here, we ask the user to enter both of these subsets (as opposed to one, as in our Fisher Analysis tool). Again, various filters may be employed to limit output. The output will be studies that match your own study by virtue of overlap with both up- and downregulated sets, i.e. your upregulated set matches another study’s upregulated set, and your downregulated set matches the same study’s downregulated set. By selecting “inverse correlations”, your upregulated set will be matched against a study’s downregulated set in our database, and your downregulated set will be matched against the same study’s upregulated set. As with the Fisher tool, if you're interested in further examining a dataset found in our own database, enter the IDs ("dbase_ids") we've given to both the upregulated and downregulated portions of a study. Entering disparate studies will not throw an error, but it doesn't make much sense. Our internal backgrounds for these studies will override any user-entered background.

The scoring system, seen in the output, is simple: we multiply the two Fisher-derived P-values together. If the up/up log(P-value) is -10 and the down/down log(P-value) is -5, the score will be 50.

If you click the “Match Studies” feature, you'll see that we've pre-loaded two datasets from our own database. These two lists contain transcripts that are canonically upregulated in a cytokine storm, and transcripts that are canonically downregulated. Following “submit”, the output may not surprise you. However, if you're interested in the question of reversing or moderating a cytokine storm via a drug, click on “Inverse Correlations” and choose “drug” as the study type. Have fun.

If your own study contains up- and downregulated portions and you’re interested in finding studies that best match your own, try this tool. A situation whereby two experiments strongly overlap in their upregulated portions, but only weakly overlap in their downregulated portions is surprisingly common, meaning that one could draw faulty or incomplete conclusions about the function of a gene by only examining one (up- or downregulated) portion of, say, the transcripts that are altered upon your gene’s knockout. 

Regulation

Here, we compare the list of genes that are coexpressed with your own gene with the list of genes that are strongly altered when your gene is knocked-out or overexpressed. If the lists overlap strongly, a strong hypothesis is that your gene of interest is responsible for regulation of its coexpressed partners (as opposed to a partner controlling expression of your gene, or an unknown entity controlling expression of your gene).

Our algorithm works as follows. First, the database is examined for studies in which your gene of interest was targeted (via knockdown, a drug, etc.). This list is short; currently, the gene Dicer is the most targeted entity in our database, with about 45 studies (i.e. 90 lists of up/down-regulated genes upon Dicer targeting). If such studies exist, we proceed to perform our standard co-regulation analysis for your gene. To be clear, filters apply to coexpression analysis, not to the list of cases in which your gene was targeted. Fisher’s exact test is then used to compare the list of coexpressed genes with all studies in which your gene was targeted. All relevant studies and their P-values are output.

One technical point: studies involving the targeted gene are excluded from coexpression analysis. This way, the “target lists” and the “coexpression list” are independent. Practically speaking, this act of separating lists makes little difference in P-values. In fact, we’ve seen cases in which significance increases slightly when the two lists are separated. Nevertheless, this step makes statistical sense.

Most commonly, the experiments in which a gene is targeted do not correlate strongly with the gene’s coexpressed partners. This implies that the gene does not regulate its coexpressed partners; some other entity may be responsible for that. In other cases, the correlation between a targeted gene and its partners may be strong. For example, try entering CDK4 in the search box. Currently, we derive a P-value of 10-10 between the list of CDK4’s coregulated partners and downregulated genes in a study in which CDK4 was treated with an inhibitor in myeloma cells (see Prolonged early G1 arrest by selective CDK4/CDK6 inhibition sensitizes myeloma cells to cytotoxic killing through cell cycle–coupled loss of IRF4). It is thus clear that CDK4 plays an “overlord” role with respect to many of its coexpressed partners. The relationship may be further probed by limiting the coexpressed partners to genes that are upregulated alongside CDK4 (P=10-7), as opposed to downregulated (P=10-39). Inhibition of CDK4 strongly downregulates the genes which tend to be upregulated alongside CDK4; genes which tend to be downregulated alongside CDK4 are less profoundly altered. Most simply: genes that like to hang out with CDK4 are upregulated by CDK4.

To further complicate analysis, apply the “Anti-Co-Expression” option. Here, upon input of CDK4, we derive a P-value of 10-5 against genes that were upregulated upon CDK4 inhibition; genes that are upregulated when CDK4 is downregulated, or vice versa, tend to be upregulated on CDK4 inhibition. Applying the upregulation and downregulation filters give P-values of 10-1.6 and 10-3, respectively. Some thought may be required when analyzing the output, but we believe powerful insights may result. Even in the case of weak P-values, a hypothesis may be generated; the gene of interest does not occupy a “controlling position” over its coexpressed or anti-coexpressed partners.

As always, be creative in using this tool. Do micro-RNA target predictions really correlate with genes that tend to be downregulated when the micro-RNA of interest is upregulated? What about experiments in which transcription factors are targeted? If you focus entirely on a particular cell type (e.g. liver, via the keyword filter), will the above CDK4 results change? If the CDK4 coregulation list corresponds strongly with CDK4 knockdown study A, but not with CDK4 knockdown study B, what is the critical difference between the two experimental setups? If your gene of interest does not occupy a controlling role over its coexpressed partners and you wish to identify a particular gene that indeed controls these partners, you can reverse the process utilized by our algorithm: take the coexpressed list (from the coexpression tool) and enter it into Fisher analysis.

Third Set

We’ll explain this tool with an example. If you have a lab-generated list of genes (X) that are upregulated on viral infection and perform standard bioenrichment on it (e.g. insert the list into our Fisher analysis tool), you probably won’t be surprised if “genes upregulated upon 24 hours dengue infection” (output A) most significantly paralleled your own list, and “genes upregulated on RIG-I overexpression” (output B) gave the second most significant overlap. It also isn’t surprising that lists A and B overlap very significantly. This A|B overlap can be problematic if you wish to generate insight into the nature of your lab results; set B is very similar to A, and thus adds very little value to A alone. In the “third set” tool, we search for situations where A and B do not overlap each other.

To use the tool, you’ll first need to enter two gene sets. If the two sets do not overlap significantly, our tool will not throw an error, but the results will not be insightful. The tool then searches for a third set (B) that significantly overlaps the first, central set (X) but does not overlap significantly with the second set (A).  Thus both A and B can offer insight into the nature of X, as there’s little overlap between A and B. In a sense, the tool finds two clusters for X.

As with the Fisher tool, if you're interested in further examining datasets found in our own database, enter the ID(s) ("dbase_id") we've given to the study/studies. If two studies are entered from our own database, internal background figures will override the user-defined background figure. If one study from our own database is entered, the smaller of the user-defined entry and the internal figure will be used as a background.

For simplicity, we set the standard P-value of 0.05 as the maximum significance for the A|B intersection. The output “p_score” is simply the log(P-value) associated with the X|A intersection minus the log(P-value) associated with the A|B intersection. Studies with the most negative p_scores are listed first.

“Exact” Fisher Analysis

This is a tool that does not call upon any underlying database.

The P-values generated when we scan your input against all studies in our database are approximate. This is because we don’t store the complete backgrounds (lists of all genes found within a study) of studies. Instead, we store the numerical backgrounds (e.g. 22,184), which allows an estimate of the actual background common to two studies.

For “journal-ready” P-values, use our “Exact” Fisher tool. You’ll need the backgrounds from two studies, and two subsets of interest from those backgrounds. All the IDs need to be of the same type (ensembl, refseq, uniprot, wingdings…it doesn’t matter). If they’re not, you’ll need to convert one type to the other (Biomart, the “Synergizer”, bioDBnet, Uniprot, and more offer tools).

In the end, you may be surprised how well our estimates align with “real” Fisher’s exact test P-values.

Intersect

This tool allows you to find the genes that are common to two sets.

If you enter two gene sets of the same type (and click the "same type" box), you'll receive the intersecting genes. This could easily be performed in an Excel spreadsheet using, for example, the "vlookup" function. In this case, you'll receive output fairly quickly.

However, this tool can be used for two situations that would be problematic for Excel. First, you can find the intersection of two gene sets that have different ID types (e.g. ENSG0000012345 and STAT6). The algorithm is simple, but requires a fair amount of processing: convert both the gene lists to the format used in our database, then find all common IDs, and then re-convert the IDs to uniprot (e.g. STAT6) format. If a uniprot ID is not available for a particular gene, we may output a different format. Second, you can use our own database IDs as input. This way you can find genes in common with your own set and the sets found in our database. Given the size of our database of IDs, the above two methods are fairly processor-intensive. We thus limit the size of a list of genes to 400 IDs, unless the "same type" box is clicked.

Cell Types

This simple, but powerful tool allows you to enter a gene name and receive a list of keywords most positively or negatively correlated with the gene. For a complete description of the tool, check out the relevant page on our blog.

General Questions

Q: Why are you funding the site with donations?

A: We intend to make a living this way. You may have noticed that numerous bioinformatic analysis sites fail to function shortly after their accompanying papers are released. Or, these sites fail to update. The financial incentive means that that we will continue to update and expand our database, improve existing tools, and add new ones over time.

Q: I want to cite your work. How?

A: Don’t feel obliged. If our tools generate insight, you will undoubtedly be pointed to studies that are relevant to your work. You can cite those studies. If a citation is a necessity, we have previously used the database and accompanying tools in an “in-house” context, without offering algorithms (or the database!) in our published work. The following PMID ID would be relevant:  30097535. We do intend to write a paper that relates much more directly to the tools and underlying database…we’ll let you know when that occurs.

Q: Can I get your database?

A: Sorry. The database represents thousands of hours of data collection. We wish to monetize it.

Q: What are the cutoffs that are used in your database? What is the requirement for a gene to be considered upregulated (versus “not regulated”)?

A: Addition of a new column of study-specific data into the database typically goes like this: a dataset is identified and then sorted according to fold-change, fold-change with a minimum significance requirement, or fold-change divided by significance (a procedure that is often used to select genes of interest in a volcano plot). The top upregulated 200 IDs are then selected, translated into the format used in the database, and entered. The numerical backgrounds (the number of positive identifications in a study, regardless of regulation status) of the studies are also noted, as these integers are utilized in Fisher analysis. The same logic applies to the lowest (downregulated) ranked IDs. Thus, one study typically results in two columns of data.

However, we have no qualms about breaking the above general pattern. Even some high-impact studies offer simple supplemental lists of genes without any fold-change, significance, or background data. We do not wish to exclude these datasets (in the case of the background figure, we’ll make a best guess…for modern transcript-based studies, our default is 20,000). In other cases, significance and/or fold-change are irrelevant…for example, a list of human transcripts with at least 50 exons.

Q: Why do you keep data regarding peptides and PTMs? It doesn’t make sense to analyze this data at the protein level!

A: Ideally, one would indeed analyze PTM data at the level of peptides. A single protein may be phosphorylated at multiple sites, and de-phosphorylated at multiple other sites upon drug treatment. Experientially, however, we’ve found that datasets such as “proteins with multiple sites that are phosphorylated on EGF treatment” actually can intersect significantly with other studies that aren’t necessarily related to the question of phosphorylation.

We also store antigen and neoantigen data. Is it possible that particular proteins are especially likely to undergo proteasomal degradation in certain contexts, with the resulting peptides being displayed by HLA? Such a trend would be noted at the protein level. Even better, it’s possible that these proteins could be coaxed to higher expression levels via some treatment.

To keep things simple…we see no harm in storing peptide and PTM data in this fashion. We do the same thing for transcripts, actually; lists such as “transcripts with extensive m6a alterations on mettl14 ko” are found in the database.

Q: What separates your database and tools from those available at other sites?

A: There’s no absolute separation. The combination of our unique tools and algorithms, emphasis on individual studies (as opposed to, say, GO enrichment groups), unconventional gene lists (e.g. the most common transcripts that aren’t targeted by micro-RNAs), sheer volume of data, and our frequent updates based on the most recent datasets, means that you might well obtain insights here that wouldn’t be obtained elsewhere. We certainly don’t disparage other tools (GSEA, DAVID, etc.); we’ve used them ourselves!

One fairly unique WIMG feature is our use of "background-adjustment" on datasets with ill-defined backgrounds; see the relevant page on our blog.

Q: Can I study splice variants with your tools?

A: Our tools are not ideal for this purpose, as we combine all splice variants in a study into a single identifier. There’s certainly no harm in entering your variant list into the Fisher analysis box and seeing what happens. One unusual case that may signal interesting splice patterns occurs when a large number of genes are found in both the upregulated and downregulated lists from a particular study. Here, the same genes may be found in both columns because some of a gene’s splice variants are upregulated, while others are downregulated.

We may yet offer splice variant analysis, but this would be a long-term project, requiring an overhaul of our database.

Q: I can’t interpret your descriptions of various studies.

A: We use numerous abbreviations and conventions for the sake of brevity. A description of these terms is found here. Also, bear in mind that the underlying database was initially developed for in-house use and may occasionally contain terms that are not relevant to site-users. We attempt to remove or clarify these descriptions as we find them. Please let us know if any descriptions are particularly difficult to interpret.

Q: I enter a gene symbol, but get no output.

A: Your gene symbol may not be found in our database. We store ensemble gene IDs for humans (e.g. ENSG0000012345) and mice, refseq IDs (NM_12345), uniprot (STAT1), and more. Given the huge variety of identification systems, however, we may not store your symbol type. Can you convert it into a more common symbol type, or into a human/mouse ortholog?

Q: You are missing a study!

A: Let us know!

Q: You made an error!

A: Let us know.

Q: What’s upcoming?

We have many ideas. In the short term:

  • Write a paper! Our hope is that the paper will be far more than an introduction to yet another gene enrichment service...we want to make a serious dent in the field.
  • Continue to fill out our blog!

Q: Who are you?

A: Briefly, I’m Kenneth Hodge. I received a BA in biochemistry from UC Santa Cruz in 1989 and then, quite naturally, became a laborer, a building contractor, and a wine chemist. I also oversaw a finance website, Stockwarrants.com (now folded), for nearly 20 years. I resumed my academic focus in 2009 at Mahidol University in Thailand, culminating in a Ph.D. in Genetic Engineering and Microbiology, followed by three years post-doc work at Chulalongkorn University. My academic focus has been on virology (especially dengue), cancer and neoantigen identification, and of course, bioinformatics. You can find me on a handful of papers. My obsessive, hoarding nature is on full display on this very website.

Along the way, of course, numerous professors and colleagues have assisted and inspired me. Their imprint can be seen on this site.

When not in hoarding mode, you might find me at 7,000 meters in the Himalayas (a 3 hour flight from Bangkok).

 

*******

Terminology

We’ll explain our terminology with examples of output you might see:

upregulated in mcf7 on mel-18 kd (GSE64716: MEL-18 loss mediates estrogen receptor)

Our default biological entity is an RNA transcript. The absence of additional information such as “MS” (for mass spectrometry results, i.e. proteins), “mirna” (micro-RNAs), and “lncrna” (long-noncoding RNAs) tells you that this experiment involved transcripts. Our default species is human, and thus we do not specify that the mcf7 cell line is human. “kd” means “knockdown”; thus the mel-18 transcript was knocked down in this study. Genes can be knocked-down, knocked-out (“ko”), or over-expressed (“oe”). GSE64716 is an identifier for a “Geo Dataset” study; if you want detailed information for the experiment, just google “GSE64716”.  We also include a portion of the title of the underlying paper; the rationale here is that the title may contain search terms that would be useful when you use one of our tools. Note that we do not specify how we treated the underlying data; our default is to simply sort according to fold-change. 

MS upregulated in caco2 on 24h covid19 infection (vs. 2h) (s2: Proteomics of SARS-CoV-2-infected host cells)

Here, the entity is proteins; the use of “MS” makes that clear. A portion of the title is again given. “s2” refers to the supplemental dataset from which the data was extracted. “24h” means 24 hours; “3d” would mean “three days.” 

downregulated in mouse pancreatic cancer line on slug1 kd (both w/glutamine deprivation)(raw data w/ttest (p<.01): GSE150874) 

Here, we’ve specified that the species in question is mouse. “both w/glutamine deprivation” means that both the slug1 kd cells and the control cells were subjected to glutamine deprivation. GSE150874 shows that we acquired the data from GEO, but the “raw data” label means that we extracted the data from an attached file, not via the “GEO2R” tool which may or may not be available for particular studies. We also show that we required that the transcripts we term “upregulated” or “downregulated” must differ from control transcripts at a significance < .01 as measured by a t-test. Having met this criteria, the transcripts are then sorted according to fold-change. If we don’t mention the significance level, it would mean that the level is .05.

You may wonder why we use a simple t-test here, as opposed to more sophisticated tools such as “limma.” The simple answer: convenience. Also, one of the oft-mentioned weaknesses of t-tests versus other approaches (negative binomial, etc.) may actually be a strength; t-tests will tend to “overvalue” transcripts with few counts versus abundant transcripts. Thus, if you hypothesize that rare transcripts may play important roles in biological outcomes, you won’t be disappointed. 

upregulated in head and neck cancer tissue w/metastasis (vs. without)(fc/p: GSE136037: Tumor Biomarkers for the Prediction of Distant Metastasis) 

There’s one term worth explaining: “fc/p”. This simply means we divided log(fold-change) by P-value, and then ranked the results. This is a simple way to combine the importance of both fold-change and significance. 

Hopefully, our terminology is now more transparent. If you have further questions, feel free to e-mail us. Also, if you don’t agree with the analysis approach we’ve used, you can always check out the underlying data and sort it according to your own methods. You can then re-enter it into our various tools.