import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scikit_posthocs as sp
import seaborn as sns
from plotly import graph_objects as go
from networkx.drawing.nx_agraph import graphviz_layout
from scipy.stats import stats, kruskal, spearmanrDATA620-Project 1
Categorical Groups and Centrality in Gene Ontology Data
Data Load
In this first project, I will use SNAP data titled Gene Function Association Network. The dataset contains an association network of over 22k nodes and over 16k edges for gene function. The edges are meant describe what a gene does or is involved in.
genes=pd.read_csv("GF-Miner_miner-gene-function.tsv", sep= "\t")
genes.head()| # GO_ID | C0 | C1 | Gene | C3 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 | C16 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | GO:0005509 | UniProtKB | A0A024QZ42 | PDCD6 | NaN | GO_REF:0000002 | IEA | InterPro:IPR002048 | F | HCG1985580, isoform CRA_c | A0A024QZ42_HUMAN|PDCD6|hCG_1985580 | protein | taxon:9606 | 20160409 | InterPro | NaN | NaN |
| 1 | GO:0004672 | UniProtKB | A0A024QZP7 | CDK1 | NaN | GO_REF:0000002 | IEA | InterPro:IPR000719|InterPro:IPR002290|InterPro... | F | Cyclin-dependent kinase 1 | A0A024QZP7_HUMAN|CDK1 | protein | taxon:9606 | 20160409 | InterPro | NaN | NaN |
| 2 | GO:0005524 | UniProtKB | A0A024QZP7 | CDK1 | NaN | GO_REF:0000038 | IEA | UniProtKB-KW:KW-0067 | F | Cyclin-dependent kinase 1 | A0A024QZP7_HUMAN|CDK1 | protein | taxon:9606 | 20160410 | UniProt | NaN | NaN |
| 3 | GO:0005634 | UniProtKB | A0A024QZP7 | CDK1 | NaN | GO_REF:0000052 | IDA | NaN | C | Cyclin-dependent kinase 1 | A0A024QZP7_HUMAN|CDK1 | protein | taxon:9606 | 20101115 | HPA | NaN | NaN |
| 4 | GO:0005737 | UniProtKB | A0A024QZP7 | CDK1 | NaN | GO_REF:0000052 | IDA | NaN | C | Cyclin-dependent kinase 1 | A0A024QZP7_HUMAN|CDK1 | protein | taxon:9606 | 20101115 | HPA | NaN | NaN |
As interesting as the molecular world is I’m unfamiliar with its many intricacies; particularly in the field of Genomics. The variables in the dataset are labeled alphanumerically. This format doesn’t explain or hint at the meaning of the observations. As it is, it’s not accessible for those unfamiliar with the subject matter. I’ll need to research the datapoints in order to understand the variables, as there is also no data dictionary provided with the data. All of the columns will require renaming for clarity.
Before I begin researching, let’s check out how many groups and NA values in each column:
#-- organizing overview--
overview= (genes.agg(["nunique", lambda x: x.isna().sum()]).T)
#-- renaming summary columns for clarity--
#-- and the index because I prefer it this way--
overview=overview.reset_index()
overview.columns=["variable", "unique", "NA count"]
overview| variable | unique | NA count | |
|---|---|---|---|
| 0 | # GO_ID | 16628 | 0 |
| 1 | C0 | 3 | 0 |
| 2 | C1 | 6676 | 0 |
| 3 | Gene | 5957 | 0 |
| 4 | C3 | 5 | 16407 |
| 5 | C5 | 4742 | 0 |
| 6 | C6 | 13 | 0 |
| 7 | C7 | 6160 | 6106 |
| 8 | C8 | 3 | 0 |
| 9 | C9 | 6006 | 0 |
| 10 | C10 | 6660 | 36 |
| 11 | C11 | 3 | 0 |
| 12 | C12 | 17 | 0 |
| 13 | C13 | 2213 | 0 |
| 14 | C14 | 22 | 0 |
| 15 | C15 | 525 | 15961 |
| 16 | C16 | 76 | 16494 |
There are several columns with many NA; C3, C7, C10, C15, C16 are all flagged. In the interest of time, I will likely impute NA values, but I do want to acknowledge the context of the NA values first. There are several columns I’m sure I wont need, so I will remove them now.
My main focus here is human data. Although the column has no missing values, all datapoints in C12 include humans, so I am sure it will not be needed. I won’t need the PubMed identifiers in C13, and I won’t need the Ontology Resource Names in C14.
#-- removing column not needed--
genes= genes.drop(columns=["C12", "C13", "C14"])Creating a Data Dictionary
To create this dictionary I have sampled random datapoints within each column and researched them until I found consensus between sources. For example, C8 is a category/type. I searched “Gene Ontology F, C, P” and found Go Enrichment Analysis Tools. The site describes how the gene ontology tool is used and they explain that a “GO Aspect” may be entered to learn of the particular gene’s aspect- “molecular function”, “cellular component”, and “biological process” (F, C, P). I searched the database for the corresponding IDs of my randomly selected rows and confirmed their listed “Aspect” on the database (F, C or P) matched their line in C8. I renamed the column category_type instead of aspect.
With the data’s dictionary completed, I’ll modify the column names to be more descriptive:
genes.columns=["go_id", "database", "gene_id", "gene_name", "action", "ref", "evidence_code", "ex_ref", "category_type", "encode_name", "encoder", "func_type", "action_on_entity", "db_ref"]
genes.head()| go_id | database | gene_id | gene_name | action | ref | evidence_code | ex_ref | category_type | encode_name | encoder | func_type | action_on_entity | db_ref | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | GO:0005509 | UniProtKB | A0A024QZ42 | PDCD6 | NaN | GO_REF:0000002 | IEA | InterPro:IPR002048 | F | HCG1985580, isoform CRA_c | A0A024QZ42_HUMAN|PDCD6|hCG_1985580 | protein | NaN | NaN |
| 1 | GO:0004672 | UniProtKB | A0A024QZP7 | CDK1 | NaN | GO_REF:0000002 | IEA | InterPro:IPR000719|InterPro:IPR002290|InterPro... | F | Cyclin-dependent kinase 1 | A0A024QZP7_HUMAN|CDK1 | protein | NaN | NaN |
| 2 | GO:0005524 | UniProtKB | A0A024QZP7 | CDK1 | NaN | GO_REF:0000038 | IEA | UniProtKB-KW:KW-0067 | F | Cyclin-dependent kinase 1 | A0A024QZP7_HUMAN|CDK1 | protein | NaN | NaN |
| 3 | GO:0005634 | UniProtKB | A0A024QZP7 | CDK1 | NaN | GO_REF:0000052 | IDA | NaN | C | Cyclin-dependent kinase 1 | A0A024QZP7_HUMAN|CDK1 | protein | NaN | NaN |
| 4 | GO:0005737 | UniProtKB | A0A024QZP7 | CDK1 | NaN | GO_REF:0000052 | IDA | NaN | C | Cyclin-dependent kinase 1 | A0A024QZP7_HUMAN|CDK1 | protein | NaN | NaN |
The NA overview will now be more readable:
overview= (genes.agg(["nunique", lambda x: x.isna().sum()]).T)
overview=overview.reset_index()
overview.columns=["variable", "unique", "NA count"]
overview| variable | unique | NA count | |
|---|---|---|---|
| 0 | go_id | 16628 | 0 |
| 1 | database | 3 | 0 |
| 2 | gene_id | 6676 | 0 |
| 3 | gene_name | 5957 | 0 |
| 4 | action | 5 | 16407 |
| 5 | ref | 4742 | 0 |
| 6 | evidence_code | 13 | 0 |
| 7 | ex_ref | 6160 | 6106 |
| 8 | category_type | 3 | 0 |
| 9 | encode_name | 6006 | 0 |
| 10 | encoder | 6660 | 36 |
| 11 | func_type | 3 | 0 |
| 12 | action_on_entity | 525 | 15961 |
| 13 | db_ref | 76 | 16494 |
Creating a Graph, Version 1
The data does specify that the edges are the functional annotations and nodes are the go_id and gene_name. In this project however we’ll use category_type and database as our categorical groups because each variable is perfect with only 3 distinct values. This is still exploratory. As such, the graph will be undirected because we’re exploring mutual connection.
action had far too many NA values (which could very well mean there is no “action” to be had for some genes). Its best to check out the denomination of these NA values while we’re on the topic.
#--grabbing the points needed--
na_actions= genes[genes["action"].isna()]
cat_total= genes.groupby(["category_type"]).size().reset_index(name="count")
na_counts= na_actions.groupby(["category_type"]).size().reset_index(name="na_count")
#--neat df--
summary_denom= pd.merge(na_counts, cat_total, on="category_type")
summary_denom["na_percent"]= (summary_denom["na_count"]/summary_denom["count"]*100).round(2)
summary_denom| category_type | na_count | count | na_percent | |
|---|---|---|---|---|
| 0 | C | 1463 | 1508 | 97.02 |
| 1 | F | 3841 | 3944 | 97.39 |
| 2 | P | 11103 | 11176 | 99.35 |
There doesn’t seem to be much difference between the categories, in terms of the NA values within the action variable. Over 90% of values are NA, but judging by the sheer volume and attention to detail of Genomic data– its very likely there is no immediate action from genes marked as NA. So its emptiness is not that its missing but rather the quality of a gene not being the direct instigator or catalyst of action.
It might be worth considering keeping the action column and using an “if” function to display action when one is available. However, this action feature would classify as directional. The rest of my data is better served in an undirected graph depicting relationships. For the sake of simplicity and time, I will choose to drop the action column.
genes= genes.drop(columns=["action", "ex_ref", "action_on_entity", "db_ref", "encoder"])gg1= nx.Graph()I ran a preliminary graph and as its pictured below and learned there is far too much data to depict. A stratified sample of the data will maintain the original balance of the categorical values in genes while remaining readable.
Stratified Sampling
#--stratified sample--
sample= genes.groupby("category_type", group_keys=False).sample(frac= 0.05)
gg1.add_edges_from(sample[["gene_name", "evidence_code"]].values)
gg1.add_edges_from(sample[["evidence_code", "go_id"]].values)
gg1.add_edges_from(sample[["go_id", "category_type"]].values)After a stratified sample, the data is ready for visualization. The vantage point I’m interested in after examining the data involves evidence_codes – the notes on the function of a gene. There are 26 unique codes in existence divided into 6 main code categories:
- experimental evidence
- phylogenetic evidence (inferred through experimental evidence)
- computational evidence
- author statements (statements made by authors in scientific papers)
- curatorial statements (an expert from a database)
- automatically generated annotations (automated, no person reviewed this)
Note: The gene data used here only contains 13 unique codes after NA removal.
Whats important about this?
The evidence codes assigned to the genes are not “equal” in the sense that one code may be more important than another, in a sort of credibility hierarchy. “Non-traceable author statements” naturally have less credibility than those indicated through an actual experiment. Lowering the opacity on gene annotations whose origin is less credible would add an uncluttered extra layer of information on the network graph.
As a draft, the visual below shows how I would rate the evidence codes on an alpha scale.Those under experimental evidence (which I would say is most credible) will be graded as 1 ( EXP, IDA, IPI, IPM , IGI, and IEP). The lowest value will be 0.2 assigned to IEA , ND (and NAS upon closer inspection)
Research Question
With that said, the research question for the rest of this project will be:
Mapping Categories
I’ll map the definitions for the evidence code acronyms for readability:
#-- The list--
ec_names= {"IEA": "Inferred from Electronic Annotation",
"IDA": "Inferred from Direct Assay",
"ISS": "Inferred from Sequence or Structural Similarity",
"IMP": "Inferred from Mutant Phenotype",
"IPI": "Inferred from Physical Interaction",
"TAS": "Traceable Author Statement",
"ND": "No Biological Data Available",
"IBA": "Inferred from Biological Aspect of Ancestor",
"IEP": "Inferred from Expression Pattern",
"EXP": "Inferred from Experiment",
"NAS": "Non-traceable Author Statement",
"IGI": "Inferred from Genetic Interaction",
"IC": "Inferred by Curator"}
cta_types= {"P": "Process","F": "Function", "C": "Component"}#-- mapping evidence names--
genes["evidence_names"]= genes["evidence_code"].map(ec_names)
#-- Mapping Aspect/ Category Types--
genes["aspect"]= genes["category_type"].map(cta_types)Below is a table of evidence codes and how they’re distributed within each category:
#--Starter Scores--
credibility_score= {"EXP": 1, "IDA": 1, "IPI": 1, "IMP": 1, "IGI": 1, "IEP": 1, "ISS": 1, "IBA": 0.5, "IC": 0.5, "TAS": 0.5,"IEA": 0.3, "NAS": 0.2, "ND": 0.2}
genes["cred_score"]= genes["evidence_code"].map(credibility_score)
cred_count= genes.groupby(["aspect", "evidence_names", "cred_score"]).size().reset_index(name="count").sort_values(["aspect", "cred_score"], ascending= [True,False])#--Pivotting for Evidence code counts--
piv_cred_count= cred_count.pivot(index="evidence_names", columns="aspect", values="count").fillna(0)
piv_cred_count= piv_cred_count[["Process", "Function", "Component"]]
piv_cred_count= piv_cred_count.sort_values(by="Process", ascending=False)
piv_cred_count| aspect | Process | Function | Component |
|---|---|---|---|
| evidence_names | |||
| Inferred from Electronic Annotation | 5072.0 | 1747.0 | 683.0 |
| Inferred from Direct Assay | 1877.0 | 864.0 | 419.0 |
| Inferred from Sequence or Structural Similarity | 1502.0 | 190.0 | 116.0 |
| Inferred from Mutant Phenotype | 1287.0 | 107.0 | 15.0 |
| Traceable Author Statement | 622.0 | 411.0 | 84.0 |
| Inferred from Biological Aspect of Ancestor | 437.0 | 267.0 | 131.0 |
| Non-traceable Author Statement | 162.0 | 76.0 | 30.0 |
| Inferred by Curator | 79.0 | 15.0 | 12.0 |
| Inferred from Genetic Interaction | 70.0 | 8.0 | 1.0 |
| Inferred from Expression Pattern | 62.0 | 0.0 | 0.0 |
| Inferred from Physical Interaction | 3.0 | 174.0 | 16.0 |
| Inferred from Experiment | 2.0 | 84.0 | 0.0 |
| No Biological Data Available | 1.0 | 1.0 | 1.0 |
Evidence Code Composition
I added grades to each evidence code reflecting the credibility hierarchy discussed earlier. It will double as alpha values once the graph is rendered. The item to address address is the evidence code composition within each aspect. I need to normalize the groups to be able to compare them to each other. Its also important to take into account the volume (count) and value (cred_score) of each scored evidence code in each category.
I’ll create a dataframe that will support visulization of my assigned credibiblity scores. First, I’ll calculate the rates in the evidence codes by group. My initial instinct was to calculate proportion, but I need to normalize to compare the evidence codes within a category because the groups are all different sizes.
#-- evidence code rates, by category--
cred_count["ec_rates"]= (cred_count["count"] / cred_count.groupby("aspect")["count"].transform("sum")).round(3)Next, I’ll add negative values to the datapoints whose credibility is less than 0.5
cred_count.loc[cred_count["cred_score"]<0.5, "ec_rates"]= -cred_count["ec_rates"]I’ll adjust the dataframe:
cred_plot= pd.pivot_table(cred_count, index= "aspect", columns= "evidence_names", values= "ec_rates")Composition on Plotly
Now well put the plot together. The evidence codes that have a low credibility score appear on the left, while the codes that have a high credibility score appear on the right.
div_plot= go.Figure()
for col in cred_plot.columns:
div_plot.add_trace(go.Bar(x= cred_plot[col], y=cred_plot.index, orientation= "h", name= col, hovertemplate= "%{y}: %{x:.3f}"))
div_plot.update_layout(template= "plotly_white", barmode= "relative", height=600, width=900, yaxis_autorange="reversed", bargap=0.5, legend_orientation="v", title= "Evidence Code Rates and Credibility by Annotation Category Type (Aspect)", title_x= 0.5, title_font_size=18)
#-- adding annotations for right and left--
div_plot.add_annotation(x= 0.25, y= 1.1, text= "credible → ", xref="x", yref= "paper",showarrow=False, font=dict(size= 12, color= "black"))
div_plot.add_annotation(x= -0.25, y= 1.1, text= "← less credible", xref="x", yref= "paper", showarrow=False, font=dict(size= 12, color= "black"))
div_plot.show()Centrality of Nodes in a Graph
Centrality and eigenvector centrality will help identify which nodes are the most connected and which of those matter.
centrality= nx.degree_centrality(gg1)
eig_centrality= nx.eigenvector_centrality(gg1, max_iter=2000)
#-- creating new columns for centrality--
gene_centrality= (sample[["gene_name"]].drop_duplicates())
gene_centrality["degree_centrality"]= gene_centrality["gene_name"].map(centrality)
gene_centrality["eig_centrality"]= gene_centrality["gene_name"].map(eig_centrality)
top_genes= gene_centrality.sort_values("eig_centrality", ascending=False).head(30)
#-- isolating top genes--
top_genes_2= gene_centrality.sort_values("eig_centrality", ascending=False).head(8)top_sample= sample[sample["gene_name"].isin(top_genes["gene_name"])]#--creating summary of centrality by nodes--
centrality_summary= pd.DataFrame({
"node": list(centrality.keys()),
"deg_centr": list(centrality.values()),
"eig_centr": list(eig_centrality.values())})
#-- attaching the categories--
cats= genes[["go_id", "category_type"]].drop_duplicates()
cats.columns= ["node", "category_type"]
#--putting them together--
centrality_summary= centrality_summary.merge(cats, on="node", how= "left")
#--labeling the nodes correctly--
centrality_summary["category_type"]= centrality_summary["category_type"].fillna("gene")
ec_names_set= set(ec_names.keys())
centrality_summary["category_type"]= (centrality_summary["node"].map(
lambda x:
"evidence_code" if x in ec_names_set
else "aspect" if x in {"C", "F", "P"}
else "go_term" if x in set(genes["go_id"].dropna())
else "gene"))Creating a Graph, Version 2
#--resetting graph--
gg1= nx.Graph()
#-- edges--
gg1.add_edges_from(top_sample[["gene_name", "evidence_code"]].values)
gg1.add_edges_from(top_sample[["evidence_code", "go_id"]].values)
gg1.add_edges_from(top_sample[["go_id", "category_type"]].values)
#--node gruops--
gene_nodes= set(top_sample["gene_name"])
evidence_nodes= set(top_sample["evidence_code"])
go_nodes= set(top_sample["go_id"])
category_nodes= set(top_sample["category_type"])I’ll assign a scaling function over the nodes in my graph. In testing I find that despite sampling there is still too many nodes and edges in the graph to make it readable. Adding a visual priority to nodes that have the highest eigen values/ are most important and connected to other important nodes.
#-- trying to emulate plotly_white's colors--"
category_color_map= {"P": "#636EFA", "F": "#00CC96", "C": "#EF553B"}
gene_color= "#AB63FA"
go_color= "#FFA15A"
evidence_code_color= "#B6B6B6"
#-- defining the scale--
base_sizes= {
"category": 800,
"gene": 300,
"evidence_code": 200,
"go_term": 20}
scale= 2500
#-- creating lists--
node_colors= []
node_sizes= []
#-- a loop to assign my specs--
for node in gg1.nodes():
if node in category_nodes:
node_colors.append(category_color_map[node])
base= base_sizes["category"]
elif node in gene_nodes:
node_colors.append(gene_color)
base= base_sizes["gene"]
elif node in evidence_nodes:
node_colors.append(evidence_code_color)
base= base_sizes["evidence_code"]
elif node in go_nodes:
node_colors.append(go_color)
base= base_sizes["go_term"]
else:
node_colors.append("#E5E5E5")
base= 40
eig= eig_centrality.get(node, 0)
node_sizes.append(base + eig * scale)
#--spacing fix for graph--
#pos_init= {
## "P": np.array([0, 0]),
# "F": np.array([30, 20]),
# "C": np.array([25, -25])}
#pos= nx.spring_layout(gg1,pos=pos_init, fixed= list(pos_init.keys()), seed=33, k=8, iterations=2000)I’ll assigned a scaling function over the nodes in my graph. In testing I found that despite sampling there were still too many nodes and edges in the graph to make it readable. Adding a visual priority to nodes that have the highest eigen values/ are most important and connected to other important nodes adds to the readability of the graph.
Next I encountered issues with overlapping nodes. Cutting down the data again was not something I could accept. Upon researching the issue, it was possible to choose anchor points on the graph in the form of arrays. I selected the aspect nodes as anchors but the graph was still not appearing in a satisfactory way.
The closest thing to a solution came from “GraphViz AGraph” a module for NetworkX.2 The module can be modified a little easier.
pos= graphviz_layout(gg1, prog="sfdp", args="-Goverlap=false -Gsep=+5")#--not labeling go nodes--
labels= {n: n for n in gg1.nodes() if n in gene_nodes| category_nodes| evidence_nodes}
#-- drawing the graph--
plt.figure(figsize=(20,20), dpi=300)
#--changing background--
plt.gca().set_facecolor("white")
plt.gcf().set_facecolor("white")
nx.draw_networkx_edges(gg1, pos, alpha=0.15, width=0.4, edge_color="#1C1C1C")
nx.draw_networkx_nodes(gg1, pos, node_color=node_colors, node_size=node_sizes)
nx.draw_networkx_labels(gg1, pos, labels=labels, font_size=8, font_color="#2a2a2a")
plt.title("Gene Ontology Annotations:\nNetworks Between GO IDs, Evidence Codes, and Associated Genes", fontsize=12, fontweight="normal", color="#1C1C1C")
plt.axis("off")
plt.show()Graph Descriptive Stats
print("Nodes:", gg1.number_of_nodes())
print("Annotation relationships (edges):", gg1.number_of_edges())
print("Density:", nx.density(gg1))
print("Avrage Connections (degree):", sum(x for n, x in gg1.degree()) / gg1.number_of_nodes())Nodes: 112
Annotation relationships (edges): 207
Density: 0.033301158301158304
Avrage Connections (degree): 3.6964285714285716
Comparing Centrality Across Categorical Groups
The categories assigned to the genes describe what the particular gene participates in. As we’ve seen, genes in this dataset predominantly participate in biological processes. And less so as a singular component or function.
#--creating summary of centrality by nodes--
centrality_summary= pd.DataFrame({
"node": list(centrality.keys()),
"deg_centr": list(centrality.values()),
"eig_centr": list(eig_centrality.values())})
#-- attaching the categories--
cats= genes[["go_id", "category_type"]].drop_duplicates()
cats.columns= ["node", "category_type"]
#--putting them together--
centrality_summary= centrality_summary.merge(cats, on="node", how= "left")
#--labeling the nodes correctly--
centrality_summary["category_type"]= centrality_summary["category_type"].fillna("gene")
ec_names_set= set(ec_names.keys())
centrality_summary["category_type"]= (centrality_summary["node"].map(
lambda x:
"evidence_code" if x in ec_names_set
else "aspect" if x in {"C", "F", "P"}
else "go_term" if x in set(genes["go_id"].dropna())
else "gene"))centrality_summary.groupby("category_type")[["deg_centr", "eig_centr"]].describe()| deg_centr | eig_centr | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
| category_type | ||||||||||||||||
| aspect | 3.0 | 0.174763 | 1.588152e-01 | 0.047319 | 0.085804 | 0.124290 | 0.238486 | 0.352681 | 3.0 | 0.172731 | 0.220775 | 0.023699 | 0.045913 | 0.068127 | 0.247247 | 0.426366 |
| evidence_code | 12.0 | 0.085174 | 1.293122e-01 | 0.003785 | 0.006940 | 0.033123 | 0.091009 | 0.447319 | 12.0 | 0.064117 | 0.154816 | 0.000250 | 0.001863 | 0.009945 | 0.041232 | 0.549412 |
| gene | 740.0 | 0.000673 | 1.669654e-04 | 0.000631 | 0.000631 | 0.000631 | 0.000631 | 0.001893 | 740.0 | 0.009329 | 0.008321 | 0.000008 | 0.001606 | 0.002614 | 0.018095 | 0.021985 |
| go_term | 831.0 | 0.001262 | 4.339420e-19 | 0.001262 | 0.001262 | 0.001262 | 0.001262 | 0.001262 | 831.0 | 0.018935 | 0.010209 | 0.000789 | 0.014892 | 0.016657 | 0.032138 | 0.032138 |
Aspect has the highest amount of connections, which makes sense, all gene annotations are labeled with an aspect, so they’re all connected to aspect. Same idea goes for evidence code. Gene is the least connected with a mean of 0.000681, this also makes sense because a gene’s functional annotations are specific.
fig, axes= plt.subplots(1, 2, figsize=(12, 5))
palette= ["#9F99E8", "#5DCAA5", "#F0997B", "#F4C0D1"]
#---Degree Cent---
sns.boxplot(data=centrality_summary,
x="category_type",
y="deg_centr",
palette=palette,
ax=axes[0])
axes[0].set(title="Degree Centrality \nMany neighbors?", xlabel="", ylabel="")
axes[0].title.set_fontsize(14)
#--- Eigenvector Centr---
sns.boxplot(
data=centrality_summary,
x="category_type",
y="eig_centr",
palette=palette,
ax=axes[1])
axes[1].set_yscale("log")
axes[1].set(title="Eigenvector Centrality (logged) \nMany important neighbors?", xlabel="", ylabel="")
axes[1].title.set_fontsize(14)
plt.tight_layout()
plt.show()The boxplot is also a great visual for the connections in this data. Aspect and evidence code are the hubs, we see they are really the only visible boxes in Degree Centrality because they are what other nodes connect to. There’s one evidence code that’s an outlier seen floating to the top of the plot, exponentially more connected than any other evidence code. Our final graph depicted this, IEA was a large node in the center.
In the Eigenvector centrality plot also works in the same manner, the boxes floating to the top are most important. Aspect is most importannt and evenly spaced. I logged it to see the differences between them more clearly. Aspect has many connections and many of those connections are important.
What we’ve seen answers the question Do category types rely on a particular evidence types?
Yes, we see that the categories (aspects) rely on IEA (Inferred from Electronic Annotation). However, the research question was more specific.
“Do certain GO aspects (category types) tend to rely on particular evidence types? If so, which genes appear within those pathways?”
We’ll need to look at pair wise relations to answer this. The Kruskal Wallis test calculates ranks shared across the categories, and because our data is not evenly distributed this nonparametric test is best.
Statistical Test on Categorical Data
Kruskal Wallis Test
“Are there differences between groups?”
#--grouping deg centr--
groups_dc= [group["deg_centr"].values
for name, group in centrality_summary.groupby("category_type")]
#-- grouping eig--
groups_ec= [group["eig_centr"].values
for name, group in centrality_summary.groupby("category_type")]
#-- Kruskal--
k_dc= kruskal(*groups_dc)
k_ec= kruskal(*groups_ec)
#-- results--
print("Kruskal Wallis Results")
print(f"(deg_centr): H= {k_dc.statistic:.3f}, p-val= {k_dc.pvalue:.3e}")
print(f"(eig_centr): H= {k_ec.statistic:.3f}, p-val= {k_ec.pvalue:.3e}")Kruskal Wallis Results
(deg_centr): H= 1400.208, p-val= 2.654e-303
(eig_centr): H= 254.699, p-val= 6.302e-55
The resulting p-values of the Kruskal test are incredibly low, so the result is considered significant. There is absolutely a difference between categories. Since the result was significant, I’ll run a Dunn post-hoc test to it calculate ranks across groups. Its considered the correct adhoc for KW.3
Post-Hoc: Dunn
We’ve established that IEA evidence code is an extreme outlier. This skews the means. We’ll pull the medians in from the centrality dataframe to compare with the Dunn results. The Dunn Post Hoc calculates the pairwise ranks which adds context to the Kruskal.
#--Grabbing Medians--
print(centrality_summary.groupby("category_type")[["deg_centr", "eig_centr"]].median().round(3)) deg_centr eig_centr
category_type
aspect 0.124 0.068
evidence_code 0.033 0.010
gene 0.001 0.003
go_term 0.001 0.017
# --- Dunn's post-hoc ---
dunn_deg= sp.posthoc_dunn(centrality_summary, val_col="deg_centr", group_col="category_type", p_adjust="fdr_bh")
dunn_eig= sp.posthoc_dunn(centrality_summary, val_col="eig_centr", group_col="category_type", p_adjust="fdr_bh")
print("Degree centrality:\n", dunn_deg)
print("\nEigenvector centrality:\n", dunn_eig)Degree centrality:
aspect evidence_code gene go_term
aspect 1.000000e+00 9.883047e-01 4.837137e-07 5.925995e-02
evidence_code 9.883047e-01 1.000000e+00 4.219411e-24 1.586584e-04
gene 4.837137e-07 4.219411e-24 1.000000e+00 9.588917e-293
go_term 5.925995e-02 1.586584e-04 9.588917e-293 1.000000e+00
Eigenvector centrality:
aspect evidence_code gene go_term
aspect 1.000000 0.042090 1.980449e-03 6.328206e-02
evidence_code 0.042090 1.000000 1.180762e-01 2.798842e-01
gene 0.001980 0.118076 1.000000e+00 5.549853e-55
go_term 0.063282 0.279884 5.549853e-55 1.000000e+00
Dunn’s output identifies gene and aspect (category_type), gene and evidence_code, gene and go_term, evidence_code and go_term, as having significant differences between pairs.
The pairs identified by eigenvector centrality are gene and go_term, gene and aspect, and gene and evidence_code. These pairs have significant differences.
So, the research question is answered. There is no statistically significant difference is shown between aspect and evidence. Gene nodes appear in nearly all the pair comparisons found in Dunns output as having significant differences.
Spearman Rank Relations
cred_score_map = dict(zip(genes["gene_name"], genes["cred_score"]))
centrality_summary["cred_score"] = centrality_summary["node"].map(cred_score_map)
# drop missing credibility scores first
s_r= centrality_summary.dropna(subset=["cred_score"])
corr_deg, p_deg= spearmanr(s_r["cred_score"], s_r["deg_centr"])
corr_eig, p_eig= spearmanr(s_r["cred_score"], s_r["eig_centr"])
print("Spearman (deg):", corr_deg, "p-val(deg):", p_deg)
print("Spearman (eig):", corr_eig, "p-val(eig):", p_eig)Spearman (deg): 0.08197957164982358 p-val(deg): 0.025742785443654166
Spearman (eig): -0.39839888156367137 p-val(eig): 1.4693999109911575e-29
The spearman rank test adds one last layer in this analysis. Its results show that there is a weak but positive relationship between credibility score and centrality, and a negative relationship between credibility score and eigen centrality. So the however slight there higher degree centrality is associated with higher credibility score. and in a stronger relationship, lower eigen centrality is associated with higher credibility.
In all, our exploration of the gene ontology data tells us that the network is structured arounf aspects and evidence codes, that influence and importance does not equal credibility and that genes nodes are very much
References
Matplotlib List of Named colors:
https://matplotlib.org/stable/gallery/color/named_colors.html
Plotly Diverging Bar Chart:
https://www.geeksforgeeks.org/python/diverging-bar-chart-using-python/
Customizing Plotly:
https://plotly.com/python/templates/https://plotly.com/python/templates/
Fixing spacing issue with Graphviz_layout()
https://networkx.org/documentation/stable/_modules/networkx/drawing/nx_agraph.html#graphviz_layout
NetworkX Crash Course: Graph Theory in Python:
https://youtu.be/VetBkjcm9Go?si=KtmFHoyRwoprbuub
Introduction nto Network Analysis in Python
Footnotes
Graphviz AGraph Documentation: https://networkx.org/documentation/stable/_modules/networkx/drawing/nx_agraph.html#graphviz_layout↩︎
A PostHoc Test for Kruskal Wallis: https://www.theanalysisfactor.com/dunns-test-post-hoc-test-after-kruskal-wallis/↩︎