DATA620-Project 1

Author

Ana Collado

Categorical Groups and Centrality in Gene Ontology Data

For your first project, you are asked to:

Identify and load a network dataset that has some categorical information available for each node.
For each of the nodes in the dataset, calculate degree centrality and eigenvector centrality.
Compare your centrality measures across your categorical groups.

Data Load

import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scikit_posthocs as sp
import seaborn as sns
from plotly import graph_objects as go
from networkx.drawing.nx_agraph import graphviz_layout
from scipy.stats import stats, kruskal, spearmanr

In this first project, I will use SNAP data titled Gene Function Association Network . The dataset contains an association network of over 22k nodes and over 16k edges for gene function. The edges are meant describe what a gene does or is involved in.

genes=pd.read_csv("GF-Miner_miner-gene-function.tsv", sep= "\t")
genes.head()

	# GO_ID	C0	C1	Gene	C3	C5	C6	C7	C8	C9	C10	C11	C12	C13	C14	C15	C16
0	GO:0005509	UniProtKB	A0A024QZ42	PDCD6	NaN	GO_REF:0000002	IEA	InterPro:IPR002048	F	HCG1985580, isoform CRA_c	A0A024QZ42_HUMAN\|PDCD6\|hCG_1985580	protein	taxon:9606	20160409	InterPro	NaN	NaN
1	GO:0004672	UniProtKB	A0A024QZP7	CDK1	NaN	GO_REF:0000002	IEA	InterPro:IPR000719\|InterPro:IPR002290\|InterPro...	F	Cyclin-dependent kinase 1	A0A024QZP7_HUMAN\|CDK1	protein	taxon:9606	20160409	InterPro	NaN	NaN
2	GO:0005524	UniProtKB	A0A024QZP7	CDK1	NaN	GO_REF:0000038	IEA	UniProtKB-KW:KW-0067	F	Cyclin-dependent kinase 1	A0A024QZP7_HUMAN\|CDK1	protein	taxon:9606	20160410	UniProt	NaN	NaN
3	GO:0005634	UniProtKB	A0A024QZP7	CDK1	NaN	GO_REF:0000052	IDA	NaN	C	Cyclin-dependent kinase 1	A0A024QZP7_HUMAN\|CDK1	protein	taxon:9606	20101115	HPA	NaN	NaN
4	GO:0005737	UniProtKB	A0A024QZP7	CDK1	NaN	GO_REF:0000052	IDA	NaN	C	Cyclin-dependent kinase 1	A0A024QZP7_HUMAN\|CDK1	protein	taxon:9606	20101115	HPA	NaN	NaN

As interesting as the molecular world is I’m unfamiliar with its many intricacies; particularly in the field of Genomics. The variables in the dataset are labeled alphanumerically. This format doesn’t explain or hint at the meaning of the observations. As it is, it’s not accessible for those unfamiliar with the subject matter. I’ll need to research the datapoints in order to understand the variables, as there is also no data dictionary provided with the data. All of the columns will require renaming for clarity.

Before I begin researching, let’s check out how many groups and NA values in each column:

#-- organizing overview--
overview= (genes.agg(["nunique", lambda x: x.isna().sum()]).T)
#-- renaming summary columns for clarity--
#-- and the index because I prefer it this way--
overview=overview.reset_index()
overview.columns=["variable", "unique", "NA count"]
overview

	variable	unique	NA count
0	# GO_ID	16628	0
1	C0	3	0
2	C1	6676	0
3	Gene	5957	0
4	C3	5	16407
5	C5	4742	0
6	C6	13	0
7	C7	6160	6106
8	C8	3	0
9	C9	6006	0
10	C10	6660	36
11	C11	3	0
12	C12	17	0
13	C13	2213	0
14	C14	22	0
15	C15	525	15961
16	C16	76	16494

There are several columns with many NA; C3, C7, C10, C15, C16 are all flagged. In the interest of time, I will likely impute NA values, but I do want to acknowledge the context of the NA values first. There are several columns I’m sure I wont need, so I will remove them now.

My main focus here is human data. Although the column has no missing values, all datapoints in C12 include humans, so I am sure it will not be needed. I won’t need the PubMed identifiers in C13, and I won’t need the Ontology Resource Names in C14.

#-- removing column not needed--
genes= genes.drop(columns=["C12", "C13", "C14"])

Creating a Data Dictionary

GO_ID : A standardized gene ontology framework of unique identifier used in biology ¹
rename to go_id
C0: Database
rename to database
C1: Gene ID- Genes receive a unique ID for database records. I’ll save this column for labeling. The ID mappings can be retrieved using https://www.uniprot.org/id-mapping
rename to gene_id
Gene : Named Genes

rename to gene_name
C3 : Action upon Gene (Colocation, Contribution. . .)
rename to action
C5: Reference
rename to ref
C6: Evidence code- Note on function of a gene 13 unique values:
https://geneontology.org/docs/guide-go-evidence-codes/
experimental evidence, phylogenetic evidence (inferred through experimental evidence), computational evidence, author statements (statements made by authors in scientific papers), curatorial statements (an expert from a database), automatically generated annotations (automated, no person reviewed this)
rename to evidence_code
maybe I can add alpha values to depict how credible the gene assignments are?
C7: external reference
rename to ex_ref
C8: Aspect/Category_type- Process, Function, Component type of role gene plays
rename to category_type
C9: Encoder_name
rename to encode_name
C10 : Encoder
rename to encoder
C11 : Function type
rename to func_type
C12 : Taxonomy. All datapoints include “8606” which is homosapien
C13 : PMID- not needed
C14: Ontology Resources- not needed
C15: Action upon entity (entity can be searched)
rename to action_on_entity
C16 : Database reference
rename to db_ref

To create this dictionary I have sampled random datapoints within each column and researched them until I found consensus between sources. For example, C8 is a category/type. I searched “Gene Ontology F, C, P” and found Go Enrichment Analysis Tools. The site describes how the gene ontology tool is used and they explain that a “GO Aspect” may be entered to learn of the particular gene’s aspect- “molecular function”, “cellular component”, and “biological process” (F, C, P). I searched the database for the corresponding IDs of my randomly selected rows and confirmed their listed “Aspect” on the database (F, C or P) matched their line in C8. I renamed the column category_type instead of aspect.

Example search, “aspect” is listed as “molecular function” and it is coded as “F” in column C8

With the data’s dictionary completed, I’ll modify the column names to be more descriptive:

genes.columns=["go_id", "database", "gene_id", "gene_name", "action", "ref", "evidence_code", "ex_ref", "category_type", "encode_name", "encoder", "func_type", "action_on_entity", "db_ref"]
genes.head()

	go_id	database	gene_id	gene_name	action	ref	evidence_code	ex_ref	category_type	encode_name	encoder	func_type	action_on_entity	db_ref
0	GO:0005509	UniProtKB	A0A024QZ42	PDCD6	NaN	GO_REF:0000002	IEA	InterPro:IPR002048	F	HCG1985580, isoform CRA_c	A0A024QZ42_HUMAN\|PDCD6\|hCG_1985580	protein	NaN	NaN
1	GO:0004672	UniProtKB	A0A024QZP7	CDK1	NaN	GO_REF:0000002	IEA	InterPro:IPR000719\|InterPro:IPR002290\|InterPro...	F	Cyclin-dependent kinase 1	A0A024QZP7_HUMAN\|CDK1	protein	NaN	NaN
2	GO:0005524	UniProtKB	A0A024QZP7	CDK1	NaN	GO_REF:0000038	IEA	UniProtKB-KW:KW-0067	F	Cyclin-dependent kinase 1	A0A024QZP7_HUMAN\|CDK1	protein	NaN	NaN
3	GO:0005634	UniProtKB	A0A024QZP7	CDK1	NaN	GO_REF:0000052	IDA	NaN	C	Cyclin-dependent kinase 1	A0A024QZP7_HUMAN\|CDK1	protein	NaN	NaN
4	GO:0005737	UniProtKB	A0A024QZP7	CDK1	NaN	GO_REF:0000052	IDA	NaN	C	Cyclin-dependent kinase 1	A0A024QZP7_HUMAN\|CDK1	protein	NaN	NaN

The NA overview will now be more readable:

overview= (genes.agg(["nunique", lambda x: x.isna().sum()]).T)

overview=overview.reset_index()
overview.columns=["variable", "unique", "NA count"]
overview

	variable	unique	NA count
0	go_id	16628	0
1	database	3	0
2	gene_id	6676	0
3	gene_name	5957	0
4	action	5	16407
5	ref	4742	0
6	evidence_code	13	0
7	ex_ref	6160	6106
8	category_type	3	0
9	encode_name	6006	0
10	encoder	6660	36
11	func_type	3	0
12	action_on_entity	525	15961
13	db_ref	76	16494

Creating a Graph, Version 1

The data does specify that the edges are the functional annotations and nodes are the go_id and gene_name. In this project however we’ll use category_type and database as our categorical groups because each variable is perfect with only 3 distinct values. This is still exploratory. As such, the graph will be undirected because we’re exploring mutual connection.

action had far too many NA values (which could very well mean there is no “action” to be had for some genes). Its best to check out the denomination of these NA values while we’re on the topic.

#--grabbing the points needed--
na_actions= genes[genes["action"].isna()]
cat_total= genes.groupby(["category_type"]).size().reset_index(name="count")
na_counts= na_actions.groupby(["category_type"]).size().reset_index(name="na_count")

#--neat df--
summary_denom= pd.merge(na_counts, cat_total, on="category_type")
summary_denom["na_percent"]= (summary_denom["na_count"]/summary_denom["count"]*100).round(2)

summary_denom

	category_type	na_count	count	na_percent
0	C	1463	1508	97.02
1	F	3841	3944	97.39
2	P	11103	11176	99.35

There doesn’t seem to be much difference between the categories, in terms of the NA values within the action variable. Over 90% of values are NA, but judging by the sheer volume and attention to detail of Genomic data– its very likely there is no immediate action from genes marked as NA. So its emptiness is not that its missing but rather the quality of a gene not being the direct instigator or catalyst of action.

It might be worth considering keeping the action column and using an “if” function to display action when one is available. However, this action feature would classify as directional. The rest of my data is better served in an undirected graph depicting relationships. For the sake of simplicity and time, I will choose to drop the action column.

genes= genes.drop(columns=["action", "ex_ref", "action_on_entity", "db_ref", "encoder"])

gg1= nx.Graph()

I ran a preliminary graph and as its pictured below and learned there is far too much data to depict. A stratified sample of the data will maintain the original balance of the categorical values in genes while remaining readable.

Stage 1: Screenshot of Full Sample Graph too Dense to Read

Stratified Sampling

#--stratified sample--
sample= genes.groupby("category_type", group_keys=False).sample(frac= 0.05)

gg1.add_edges_from(sample[["gene_name", "evidence_code"]].values)
gg1.add_edges_from(sample[["evidence_code", "go_id"]].values)
gg1.add_edges_from(sample[["go_id", "category_type"]].values)

After a stratified sample, the data is ready for visualization. The vantage point I’m interested in after examining the data involves evidence_codes – the notes on the function of a gene. There are 26 unique codes in existence divided into 6 main code categories:

experimental evidence
phylogenetic evidence (inferred through experimental evidence)
computational evidence
author statements (statements made by authors in scientific papers)
curatorial statements (an expert from a database)
automatically generated annotations (automated, no person reviewed this)

Note: The gene data used here only contains 13 unique codes after NA removal.

Whats important about this?

The evidence codes assigned to the genes are not “equal” in the sense that one code may be more important than another, in a sort of credibility hierarchy. “Non-traceable author statements” naturally have less credibility than those indicated through an actual experiment. Lowering the opacity on gene annotations whose origin is less credible would add an uncluttered extra layer of information on the network graph.

As a draft, the visual below shows how I would rate the evidence codes on an alpha scale.Those under experimental evidence (which I would say is most credible) will be graded as 1 ( EXP, IDA, IPI, IPM , IGI, and IEP). The lowest value will be 0.2 assigned to IEA , ND (and NAS upon closer inspection)

Research Question

With that said, the research question for the rest of this project will be:

“Do certain GO aspects (category types) tend to rely on particular evidence types? If so, which genes appear within those pathways?”

Mapping Categories

I’ll map the definitions for the evidence code acronyms for readability:

#-- The list-- 
ec_names= {"IEA": "Inferred from Electronic Annotation",
"IDA": "Inferred from Direct Assay",
"ISS": "Inferred from Sequence or Structural Similarity",
"IMP": "Inferred from Mutant Phenotype",
"IPI": "Inferred from Physical Interaction",
"TAS": "Traceable Author Statement",
"ND": "No Biological Data Available",
"IBA": "Inferred from Biological Aspect of Ancestor",
"IEP": "Inferred from Expression Pattern",
"EXP": "Inferred from Experiment",
"NAS": "Non-traceable Author Statement",
"IGI": "Inferred from Genetic Interaction",
"IC": "Inferred by Curator"}

cta_types= {"P": "Process","F": "Function", "C": "Component"}

#-- mapping evidence names--
genes["evidence_names"]= genes["evidence_code"].map(ec_names)

#-- Mapping Aspect/ Category Types--
genes["aspect"]= genes["category_type"].map(cta_types)

Below is a table of evidence codes and how they’re distributed within each category:

#--Starter Scores--
credibility_score= {"EXP": 1, "IDA": 1, "IPI": 1, "IMP": 1, "IGI": 1, "IEP": 1, "ISS": 1, "IBA": 0.5, "IC": 0.5, "TAS": 0.5,"IEA": 0.3, "NAS": 0.2, "ND": 0.2}

genes["cred_score"]= genes["evidence_code"].map(credibility_score)

cred_count= genes.groupby(["aspect", "evidence_names", "cred_score"]).size().reset_index(name="count").sort_values(["aspect", "cred_score"], ascending= [True,False])

#--Pivotting for Evidence code counts--
piv_cred_count= cred_count.pivot(index="evidence_names", columns="aspect", values="count").fillna(0)

piv_cred_count= piv_cred_count[["Process", "Function", "Component"]]
piv_cred_count= piv_cred_count.sort_values(by="Process", ascending=False)

piv_cred_count

aspect	Process	Function	Component
evidence_names
Inferred from Electronic Annotation	5072.0	1747.0	683.0
Inferred from Direct Assay	1877.0	864.0	419.0
Inferred from Sequence or Structural Similarity	1502.0	190.0	116.0
Inferred from Mutant Phenotype	1287.0	107.0	15.0
Traceable Author Statement	622.0	411.0	84.0
Inferred from Biological Aspect of Ancestor	437.0	267.0	131.0
Non-traceable Author Statement	162.0	76.0	30.0
Inferred by Curator	79.0	15.0	12.0
Inferred from Genetic Interaction	70.0	8.0	1.0
Inferred from Expression Pattern	62.0	0.0	0.0
Inferred from Physical Interaction	3.0	174.0	16.0
Inferred from Experiment	2.0	84.0	0.0
No Biological Data Available	1.0	1.0	1.0

Evidence Code Composition

I added grades to each evidence code reflecting the credibility hierarchy discussed earlier. It will double as alpha values once the graph is rendered. The item to address address is the evidence code composition within each aspect. I need to normalize the groups to be able to compare them to each other. Its also important to take into account the volume (count) and value (cred_score) of each scored evidence code in each category.

I’ll create a dataframe that will support visulization of my assigned credibiblity scores. First, I’ll calculate the rates in the evidence codes by group. My initial instinct was to calculate proportion, but I need to normalize to compare the evidence codes within a category because the groups are all different sizes.

#-- evidence code rates, by category--
cred_count["ec_rates"]= (cred_count["count"] / cred_count.groupby("aspect")["count"].transform("sum")).round(3)

Next, I’ll add negative values to the datapoints whose credibility is less than 0.5

cred_count.loc[cred_count["cred_score"]<0.5, "ec_rates"]= -cred_count["ec_rates"]

I’ll adjust the dataframe:

cred_plot= pd.pivot_table(cred_count, index= "aspect", columns= "evidence_names", values= "ec_rates")

Composition on Plotly

Now well put the plot together. The evidence codes that have a low credibility score appear on the left, while the codes that have a high credibility score appear on the right.

div_plot= go.Figure()

for col in cred_plot.columns:
  div_plot.add_trace(go.Bar(x= cred_plot[col], y=cred_plot.index, orientation= "h", name= col, hovertemplate= "%{y}: %{x:.3f}"))

div_plot.update_layout(template= "plotly_white", barmode= "relative", height=600, width=900, yaxis_autorange="reversed", bargap=0.5, legend_orientation="v", title= "Evidence Code Rates and Credibility by Annotation Category Type (Aspect)", title_x= 0.5, title_font_size=18)

#-- adding annotations for right and left-- 
div_plot.add_annotation(x= 0.25, y= 1.1, text= "credible → ", xref="x", yref= "paper",showarrow=False, font=dict(size= 12, color= "black"))

div_plot.add_annotation(x= -0.25, y= 1.1, text= "← less credible", xref="x", yref= "paper", showarrow=False, font=dict(size= 12, color= "black"))

div_plot.show()

Centrality of Nodes in a Graph

Centrality and eigenvector centrality will help identify which nodes are the most connected and which of those matter.

centrality= nx.degree_centrality(gg1)
eig_centrality= nx.eigenvector_centrality(gg1, max_iter=2000)

#-- creating new columns for centrality-- 
gene_centrality= (sample[["gene_name"]].drop_duplicates())
gene_centrality["degree_centrality"]= gene_centrality["gene_name"].map(centrality)
gene_centrality["eig_centrality"]= gene_centrality["gene_name"].map(eig_centrality)


top_genes= gene_centrality.sort_values("eig_centrality", ascending=False).head(30)

#-- isolating top genes--
top_genes_2= gene_centrality.sort_values("eig_centrality", ascending=False).head(8)

top_sample= sample[sample["gene_name"].isin(top_genes["gene_name"])]

#--creating summary of centrality by nodes--
centrality_summary= pd.DataFrame({
  "node": list(centrality.keys()),
  "deg_centr": list(centrality.values()),
  "eig_centr": list(eig_centrality.values())})
  
#-- attaching the categories--
cats= genes[["go_id", "category_type"]].drop_duplicates()
cats.columns= ["node", "category_type"]

#--putting them together--
centrality_summary= centrality_summary.merge(cats, on="node", how= "left")

#--labeling the nodes correctly--
centrality_summary["category_type"]= centrality_summary["category_type"].fillna("gene")

ec_names_set= set(ec_names.keys())

centrality_summary["category_type"]= (centrality_summary["node"].map(
  lambda x: 
        "evidence_code" if x in ec_names_set
        else "aspect"    if x in {"C", "F", "P"}
        else "go_term"   if x in set(genes["go_id"].dropna())
        else "gene"))

Creating a Graph, Version 2

#--resetting graph--
gg1= nx.Graph()

#-- edges--
gg1.add_edges_from(top_sample[["gene_name", "evidence_code"]].values)
gg1.add_edges_from(top_sample[["evidence_code", "go_id"]].values)
gg1.add_edges_from(top_sample[["go_id", "category_type"]].values)

#--node gruops--
gene_nodes= set(top_sample["gene_name"])
evidence_nodes= set(top_sample["evidence_code"])
go_nodes= set(top_sample["go_id"])
category_nodes= set(top_sample["category_type"])

I’ll assign a scaling function over the nodes in my graph. In testing I find that despite sampling there is still too many nodes and edges in the graph to make it readable. Adding a visual priority to nodes that have the highest eigen values/ are most important and connected to other important nodes.

#-- trying to emulate plotly_white's colors--"
category_color_map= {"P": "#636EFA", "F": "#00CC96", "C": "#EF553B"}
gene_color= "#AB63FA"
go_color= "#FFA15A"
evidence_code_color= "#B6B6B6"

#-- defining the scale--
base_sizes= {
    "category": 800,
    "gene": 300,
    "evidence_code": 200,
    "go_term": 20}
scale= 2500

#-- creating lists--
node_colors= []
node_sizes= []

#-- a loop to assign my specs--
for node in gg1.nodes():
    if node in category_nodes:
      node_colors.append(category_color_map[node])
      base= base_sizes["category"]
    elif node in gene_nodes:
        node_colors.append(gene_color)
        base= base_sizes["gene"]
    elif node in evidence_nodes:
        node_colors.append(evidence_code_color)
        base= base_sizes["evidence_code"]
    elif node in go_nodes:
        node_colors.append(go_color)
        base= base_sizes["go_term"]
    else:
        node_colors.append("#E5E5E5")
        base= 40
    eig= eig_centrality.get(node, 0)
    node_sizes.append(base + eig * scale)


#--spacing fix for graph--
#pos_init= {
   ## "P": np.array([0, 0]),
   # "F": np.array([30, 20]),
   # "C": np.array([25, -25])}
        
#pos= nx.spring_layout(gg1,pos=pos_init, fixed= list(pos_init.keys()), seed=33, k=8, iterations=2000)

I’ll assigned a scaling function over the nodes in my graph. In testing I found that despite sampling there were still too many nodes and edges in the graph to make it readable. Adding a visual priority to nodes that have the highest eigen values/ are most important and connected to other important nodes adds to the readability of the graph.

Next I encountered issues with overlapping nodes. Cutting down the data again was not something I could accept. Upon researching the issue, it was possible to choose anchor points on the graph in the form of arrays. I selected the aspect nodes as anchors but the graph was still not appearing in a satisfactory way.

The closest thing to a solution came from “GraphViz AGraph” a module for NetworkX.² The module can be modified a little easier.

pos= graphviz_layout(gg1, prog="sfdp", args="-Goverlap=false -Gsep=+5")

#--not labeling go nodes--
labels= {n: n for n in gg1.nodes() if n in gene_nodes| category_nodes| evidence_nodes}

#-- drawing the graph--
plt.figure(figsize=(20,20), dpi=300)
#--changing background--
plt.gca().set_facecolor("white")
plt.gcf().set_facecolor("white")

nx.draw_networkx_edges(gg1, pos, alpha=0.15, width=0.4, edge_color="#1C1C1C")
nx.draw_networkx_nodes(gg1, pos, node_color=node_colors, node_size=node_sizes)
nx.draw_networkx_labels(gg1, pos, labels=labels, font_size=8, font_color="#2a2a2a")

plt.title("Gene Ontology Annotations:\nNetworks Between GO IDs, Evidence Codes, and Associated Genes", fontsize=12, fontweight="normal", color="#1C1C1C")
plt.axis("off")
plt.show()

Graph Descriptive Stats

print("Nodes:", gg1.number_of_nodes())
print("Annotation relationships (edges):", gg1.number_of_edges())
print("Density:", nx.density(gg1))
print("Avrage Connections (degree):", sum(x for n, x in gg1.degree()) / gg1.number_of_nodes())

Nodes: 112
Annotation relationships (edges): 207
Density: 0.033301158301158304
Avrage Connections (degree): 3.6964285714285716

Comparing Centrality Across Categorical Groups

The categories assigned to the genes describe what the particular gene participates in. As we’ve seen, genes in this dataset predominantly participate in biological processes. And less so as a singular component or function.

#--creating summary of centrality by nodes--
centrality_summary= pd.DataFrame({
  "node": list(centrality.keys()),
  "deg_centr": list(centrality.values()),
  "eig_centr": list(eig_centrality.values())})
  
#-- attaching the categories--
cats= genes[["go_id", "category_type"]].drop_duplicates()
cats.columns= ["node", "category_type"]

#--putting them together--
centrality_summary= centrality_summary.merge(cats, on="node", how= "left")

#--labeling the nodes correctly--
centrality_summary["category_type"]= centrality_summary["category_type"].fillna("gene")

ec_names_set= set(ec_names.keys())

centrality_summary["category_type"]= (centrality_summary["node"].map(
  lambda x: 
        "evidence_code" if x in ec_names_set
        else "aspect"    if x in {"C", "F", "P"}
        else "go_term"   if x in set(genes["go_id"].dropna())
        else "gene"))

centrality_summary.groupby("category_type")[["deg_centr", "eig_centr"]].describe()

	deg_centr								eig_centr
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
category_type
aspect	3.0	0.174763	1.588152e-01	0.047319	0.085804	0.124290	0.238486	0.352681	3.0	0.172731	0.220775	0.023699	0.045913	0.068127	0.247247	0.426366
evidence_code	12.0	0.085174	1.293122e-01	0.003785	0.006940	0.033123	0.091009	0.447319	12.0	0.064117	0.154816	0.000250	0.001863	0.009945	0.041232	0.549412
gene	740.0	0.000673	1.669654e-04	0.000631	0.000631	0.000631	0.000631	0.001893	740.0	0.009329	0.008321	0.000008	0.001606	0.002614	0.018095	0.021985
go_term	831.0	0.001262	4.339420e-19	0.001262	0.001262	0.001262	0.001262	0.001262	831.0	0.018935	0.010209	0.000789	0.014892	0.016657	0.032138	0.032138

Aspect has the highest amount of connections, which makes sense, all gene annotations are labeled with an aspect, so they’re all connected to aspect. Same idea goes for evidence code. Gene is the least connected with a mean of 0.000681, this also makes sense because a gene’s functional annotations are specific.

fig, axes= plt.subplots(1, 2, figsize=(12, 5))
palette= ["#9F99E8", "#5DCAA5", "#F0997B", "#F4C0D1"]
#---Degree Cent---
sns.boxplot(data=centrality_summary,
    x="category_type",
    y="deg_centr",
    palette=palette,
    ax=axes[0])
axes[0].set(title="Degree Centrality \nMany neighbors?", xlabel="", ylabel="")
axes[0].title.set_fontsize(14)

#--- Eigenvector Centr---
sns.boxplot(
    data=centrality_summary,
    x="category_type",
    y="eig_centr",
    palette=palette,
    ax=axes[1])
axes[1].set_yscale("log")
axes[1].set(title="Eigenvector Centrality (logged) \nMany important neighbors?", xlabel="", ylabel="")
axes[1].title.set_fontsize(14)

plt.tight_layout()
plt.show()

The boxplot is also a great visual for the connections in this data. Aspect and evidence code are the hubs, we see they are really the only visible boxes in Degree Centrality because they are what other nodes connect to. There’s one evidence code that’s an outlier seen floating to the top of the plot, exponentially more connected than any other evidence code. Our final graph depicted this, IEA was a large node in the center.

In the Eigenvector centrality plot also works in the same manner, the boxes floating to the top are most important. Aspect is most importannt and evenly spaced. I logged it to see the differences between them more clearly. Aspect has many connections and many of those connections are important.

What we’ve seen answers the question Do category types rely on a particular evidence types?

Yes, we see that the categories (aspects) rely on IEA (Inferred from Electronic Annotation). However, the research question was more specific.

“Do certain GO aspects (category types) tend to rely on particular evidence types? If so, which genes appear within those pathways?”

We’ll need to look at pair wise relations to answer this. The Kruskal Wallis test calculates ranks shared across the categories, and because our data is not evenly distributed this nonparametric test is best.

Statistical Test on Categorical Data

Kruskal Wallis Test

“Are there differences between groups?”

#--grouping deg centr--
groups_dc= [group["deg_centr"].values
for name, group in centrality_summary.groupby("category_type")]

#-- grouping eig-- 
groups_ec= [group["eig_centr"].values
for name, group in centrality_summary.groupby("category_type")]

#-- Kruskal-- 
k_dc= kruskal(*groups_dc)
k_ec= kruskal(*groups_ec)

#-- results--
print("Kruskal Wallis Results")
print(f"(deg_centr): H= {k_dc.statistic:.3f}, p-val= {k_dc.pvalue:.3e}")
print(f"(eig_centr): H= {k_ec.statistic:.3f}, p-val= {k_ec.pvalue:.3e}")

Kruskal Wallis Results
(deg_centr): H= 1400.208, p-val= 2.654e-303
(eig_centr): H= 254.699, p-val= 6.302e-55

The resulting p-values of the Kruskal test are incredibly low, so the result is considered significant. There is absolutely a difference between categories. Since the result was significant, I’ll run a Dunn post-hoc test to it calculate ranks across groups. Its considered the correct adhoc for KW.³

Post-Hoc: Dunn

We’ve established that IEA evidence code is an extreme outlier. This skews the means. We’ll pull the medians in from the centrality dataframe to compare with the Dunn results. The Dunn Post Hoc calculates the pairwise ranks which adds context to the Kruskal.

#--Grabbing Medians--
print(centrality_summary.groupby("category_type")[["deg_centr", "eig_centr"]].median().round(3))

               deg_centr  eig_centr
category_type                      
aspect             0.124      0.068
evidence_code      0.033      0.010
gene               0.001      0.003
go_term            0.001      0.017

# --- Dunn's post-hoc ---
dunn_deg= sp.posthoc_dunn(centrality_summary, val_col="deg_centr", group_col="category_type", p_adjust="fdr_bh")

dunn_eig= sp.posthoc_dunn(centrality_summary, val_col="eig_centr", group_col="category_type", p_adjust="fdr_bh")

print("Degree centrality:\n", dunn_deg)
print("\nEigenvector centrality:\n", dunn_eig)

Degree centrality:
                      aspect  evidence_code           gene        go_term
aspect         1.000000e+00   9.883047e-01   4.837137e-07   5.925995e-02
evidence_code  9.883047e-01   1.000000e+00   4.219411e-24   1.586584e-04
gene           4.837137e-07   4.219411e-24   1.000000e+00  9.588917e-293
go_term        5.925995e-02   1.586584e-04  9.588917e-293   1.000000e+00

Eigenvector centrality:
                  aspect  evidence_code          gene       go_term
aspect         1.000000       0.042090  1.980449e-03  6.328206e-02
evidence_code  0.042090       1.000000  1.180762e-01  2.798842e-01
gene           0.001980       0.118076  1.000000e+00  5.549853e-55
go_term        0.063282       0.279884  5.549853e-55  1.000000e+00

Dunn’s output identifies gene and aspect (category_type), gene and evidence_code, gene and go_term, evidence_code and go_term, as having significant differences between pairs.

The pairs identified by eigenvector centrality are gene and go_term, gene and aspect, and gene and evidence_code. These pairs have significant differences.

So, the research question is answered. There is no statistically significant difference is shown between aspect and evidence. Gene nodes appear in nearly all the pair comparisons found in Dunns output as having significant differences.

Spearman Rank Relations

cred_score_map = dict(zip(genes["gene_name"], genes["cred_score"]))
centrality_summary["cred_score"] = centrality_summary["node"].map(cred_score_map)
# drop missing credibility scores first
s_r= centrality_summary.dropna(subset=["cred_score"])

corr_deg, p_deg= spearmanr(s_r["cred_score"], s_r["deg_centr"])
corr_eig, p_eig= spearmanr(s_r["cred_score"], s_r["eig_centr"])

print("Spearman (deg):", corr_deg, "p-val(deg):", p_deg)
print("Spearman (eig):", corr_eig, "p-val(eig):", p_eig)

Spearman (deg): 0.08197957164982358 p-val(deg): 0.025742785443654166
Spearman (eig): -0.39839888156367137 p-val(eig): 1.4693999109911575e-29

The spearman rank test adds one last layer in this analysis. Its results show that there is a weak but positive relationship between credibility score and centrality, and a negative relationship between credibility score and eigen centrality. So the however slight there higher degree centrality is associated with higher credibility score. and in a stronger relationship, lower eigen centrality is associated with higher credibility.

In all, our exploration of the gene ontology data tells us that the network is structured arounf aspects and evidence codes, that influence and importance does not equal credibility and that genes nodes are very much

References

Matplotlib List of Named colors:
https://matplotlib.org/stable/gallery/color/named_colors.html

Plotly Diverging Bar Chart:

https://www.geeksforgeeks.org/python/diverging-bar-chart-using-python/

Customizing Plotly:

https://plotly.com/python/templates/https://plotly.com/python/templates/

Fixing spacing issue with Graphviz_layout()

https://networkx.org/documentation/stable/_modules/networkx/drawing/nx_agraph.html#graphviz_layout

NetworkX Crash Course: Graph Theory in Python:

https://youtu.be/VetBkjcm9Go?si=KtmFHoyRwoprbuub

Introduction nto Network Analysis in Python

https://www.datacamp.com/courses/introduction-to-network-analysis-in-python?utm_medium=organic_social&utm_campaign=sharewidget&utm_content=coursedetailpage

Footnotes

https://geneontology.org/docs/ontology-documentation/↩︎
Graphviz AGraph Documentation: https://networkx.org/documentation/stable/_modules/networkx/drawing/nx_agraph.html#graphviz_layout ↩︎
A PostHoc Test for Kruskal Wallis: https://www.theanalysisfactor.com/dunns-test-post-hoc-test-after-kruskal-wallis/↩︎