DATA620-Project 1

Author

Ana Collado

Categorical Groups and Centrality in Gene Ontology Data

For your first project, you are asked to:

  1. Identify and load a network dataset that has some categorical information available for each node.

  2. For each of the nodes in the dataset, calculate degree centrality and eigenvector centrality.

  3. Compare your centrality measures across your categorical groups.

Data Load

import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scikit_posthocs as sp
import seaborn as sns
from plotly import graph_objects as go
from networkx.drawing.nx_agraph import graphviz_layout
from scipy.stats import stats, kruskal, spearmanr

In this first project, I will use SNAP data titled Gene Function Association Network. The dataset contains an association network of over 22k nodes and over 16k edges for gene function. The edges are meant describe what a gene does or is involved in.

genes=pd.read_csv("GF-Miner_miner-gene-function.tsv", sep= "\t")
genes.head()
# GO_ID C0 C1 Gene C3 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16
0 GO:0005509 UniProtKB A0A024QZ42 PDCD6 NaN GO_REF:0000002 IEA InterPro:IPR002048 F HCG1985580, isoform CRA_c A0A024QZ42_HUMAN|PDCD6|hCG_1985580 protein taxon:9606 20160409 InterPro NaN NaN
1 GO:0004672 UniProtKB A0A024QZP7 CDK1 NaN GO_REF:0000002 IEA InterPro:IPR000719|InterPro:IPR002290|InterPro... F Cyclin-dependent kinase 1 A0A024QZP7_HUMAN|CDK1 protein taxon:9606 20160409 InterPro NaN NaN
2 GO:0005524 UniProtKB A0A024QZP7 CDK1 NaN GO_REF:0000038 IEA UniProtKB-KW:KW-0067 F Cyclin-dependent kinase 1 A0A024QZP7_HUMAN|CDK1 protein taxon:9606 20160410 UniProt NaN NaN
3 GO:0005634 UniProtKB A0A024QZP7 CDK1 NaN GO_REF:0000052 IDA NaN C Cyclin-dependent kinase 1 A0A024QZP7_HUMAN|CDK1 protein taxon:9606 20101115 HPA NaN NaN
4 GO:0005737 UniProtKB A0A024QZP7 CDK1 NaN GO_REF:0000052 IDA NaN C Cyclin-dependent kinase 1 A0A024QZP7_HUMAN|CDK1 protein taxon:9606 20101115 HPA NaN NaN

As interesting as the molecular world is I’m unfamiliar with its many intricacies; particularly in the field of Genomics. The variables in the dataset are labeled alphanumerically. This format doesn’t explain or hint at the meaning of the observations. As it is, it’s not accessible for those unfamiliar with the subject matter. I’ll need to research the datapoints in order to understand the variables, as there is also no data dictionary provided with the data. All of the columns will require renaming for clarity.

Before I begin researching, let’s check out how many groups and NA values in each column:

#-- organizing overview--
overview= (genes.agg(["nunique", lambda x: x.isna().sum()]).T)
#-- renaming summary columns for clarity--
#-- and the index because I prefer it this way--
overview=overview.reset_index()
overview.columns=["variable", "unique", "NA count"]
overview
variable unique NA count
0 # GO_ID 16628 0
1 C0 3 0
2 C1 6676 0
3 Gene 5957 0
4 C3 5 16407
5 C5 4742 0
6 C6 13 0
7 C7 6160 6106
8 C8 3 0
9 C9 6006 0
10 C10 6660 36
11 C11 3 0
12 C12 17 0
13 C13 2213 0
14 C14 22 0
15 C15 525 15961
16 C16 76 16494

There are several columns with many NA; C3, C7, C10, C15, C16 are all flagged. In the interest of time, I will likely impute NA values, but I do want to acknowledge the context of the NA values first. There are several columns I’m sure I wont need, so I will remove them now.

My main focus here is human data. Although the column has no missing values, all datapoints in C12 include humans, so I am sure it will not be needed. I won’t need the PubMed identifiers in C13, and I won’t need the Ontology Resource Names in C14.

#-- removing column not needed--
genes= genes.drop(columns=["C12", "C13", "C14"])

Creating a Data Dictionary

  • GO_ID : A standardized gene ontology framework of unique identifier used in biology 1
    rename to go_id

  • C0: Database
    rename to database

  • C1: Gene ID- Genes receive a unique ID for database records. I’ll save this column for labeling. The ID mappings can be retrieved using https://www.uniprot.org/id-mapping
    rename to gene_id

  • Gene : Named Genes

    rename to gene_name

  • C3 : Action upon Gene (Colocation, Contribution. . .)
    rename to action

  • C5: Reference
    rename to ref

  • C6: Evidence code- Note on function of a gene 13 unique values:
    https://geneontology.org/docs/guide-go-evidence-codes/
    experimental evidence, phylogenetic evidence (inferred through experimental evidence), computational evidence, author statements (statements made by authors in scientific papers), curatorial statements (an expert from a database), automatically generated annotations (automated, no person reviewed this)
    rename to evidence_code
    maybe I can add alpha values to depict how credible the gene assignments are?

  • C7: external reference
    rename to ex_ref

  • C8: Aspect/Category_type- Process, Function, Component type of role gene plays
    rename to category_type

  • C9: Encoder_name
    rename to encode_name

  • C10 : Encoder
    rename to encoder

  • C11 : Function type
    rename to func_type

  • C12 : Taxonomy. All datapoints include “8606” which is homosapien

  • C13 : PMID- not needed

  • C14: Ontology Resources- not needed

  • C15: Action upon entity (entity can be searched)
    rename to action_on_entity

  • C16 : Database reference
    rename to db_ref

To create this dictionary I have sampled random datapoints within each column and researched them until I found consensus between sources. For example, C8 is a category/type. I searched “Gene Ontology F, C, P” and found Go Enrichment Analysis Tools. The site describes how the gene ontology tool is used and they explain that a “GO Aspect” may be entered to learn of the particular gene’s aspect- “molecular function”, “cellular component”, and “biological process” (F, C, P). I searched the database for the corresponding IDs of my randomly selected rows and confirmed their listed “Aspect” on the database (F, C or P) matched their line in C8. I renamed the column category_type instead of aspect.


Example search, “aspect” is listed as “molecular function” and it is coded as “F” in column C8

With the data’s dictionary completed, I’ll modify the column names to be more descriptive:

genes.columns=["go_id", "database", "gene_id", "gene_name", "action", "ref", "evidence_code", "ex_ref", "category_type", "encode_name", "encoder", "func_type", "action_on_entity", "db_ref"]
genes.head()
go_id database gene_id gene_name action ref evidence_code ex_ref category_type encode_name encoder func_type action_on_entity db_ref
0 GO:0005509 UniProtKB A0A024QZ42 PDCD6 NaN GO_REF:0000002 IEA InterPro:IPR002048 F HCG1985580, isoform CRA_c A0A024QZ42_HUMAN|PDCD6|hCG_1985580 protein NaN NaN
1 GO:0004672 UniProtKB A0A024QZP7 CDK1 NaN GO_REF:0000002 IEA InterPro:IPR000719|InterPro:IPR002290|InterPro... F Cyclin-dependent kinase 1 A0A024QZP7_HUMAN|CDK1 protein NaN NaN
2 GO:0005524 UniProtKB A0A024QZP7 CDK1 NaN GO_REF:0000038 IEA UniProtKB-KW:KW-0067 F Cyclin-dependent kinase 1 A0A024QZP7_HUMAN|CDK1 protein NaN NaN
3 GO:0005634 UniProtKB A0A024QZP7 CDK1 NaN GO_REF:0000052 IDA NaN C Cyclin-dependent kinase 1 A0A024QZP7_HUMAN|CDK1 protein NaN NaN
4 GO:0005737 UniProtKB A0A024QZP7 CDK1 NaN GO_REF:0000052 IDA NaN C Cyclin-dependent kinase 1 A0A024QZP7_HUMAN|CDK1 protein NaN NaN

The NA overview will now be more readable:

overview= (genes.agg(["nunique", lambda x: x.isna().sum()]).T)

overview=overview.reset_index()
overview.columns=["variable", "unique", "NA count"]
overview
variable unique NA count
0 go_id 16628 0
1 database 3 0
2 gene_id 6676 0
3 gene_name 5957 0
4 action 5 16407
5 ref 4742 0
6 evidence_code 13 0
7 ex_ref 6160 6106
8 category_type 3 0
9 encode_name 6006 0
10 encoder 6660 36
11 func_type 3 0
12 action_on_entity 525 15961
13 db_ref 76 16494

Creating a Graph, Version 1

The data does specify that the edges are the functional annotations and nodes are the go_id and gene_name. In this project however we’ll use category_type and database as our categorical groups because each variable is perfect with only 3 distinct values. This is still exploratory. As such, the graph will be undirected because we’re exploring mutual connection.

action had far too many NA values (which could very well mean there is no “action” to be had for some genes). Its best to check out the denomination of these NA values while we’re on the topic.

#--grabbing the points needed--
na_actions= genes[genes["action"].isna()]
cat_total= genes.groupby(["category_type"]).size().reset_index(name="count")
na_counts= na_actions.groupby(["category_type"]).size().reset_index(name="na_count")

#--neat df--
summary_denom= pd.merge(na_counts, cat_total, on="category_type")
summary_denom["na_percent"]= (summary_denom["na_count"]/summary_denom["count"]*100).round(2)

summary_denom
category_type na_count count na_percent
0 C 1463 1508 97.02
1 F 3841 3944 97.39
2 P 11103 11176 99.35

There doesn’t seem to be much difference between the categories, in terms of the NA values within the action variable. Over 90% of values are NA, but judging by the sheer volume and attention to detail of Genomic data– its very likely there is no immediate action from genes marked as NA. So its emptiness is not that its missing but rather the quality of a gene not being the direct instigator or catalyst of action.

It might be worth considering keeping the action column and using an “if” function to display action when one is available. However, this action feature would classify as directional. The rest of my data is better served in an undirected graph depicting relationships. For the sake of simplicity and time, I will choose to drop the action column.

genes= genes.drop(columns=["action", "ex_ref", "action_on_entity", "db_ref", "encoder"])
gg1= nx.Graph()

I ran a preliminary graph and as its pictured below and learned there is far too much data to depict. A stratified sample of the data will maintain the original balance of the categorical values in genes while remaining readable.

Stage 1: Screenshot of Full Sample Graph too Dense to Read

Stratified Sampling

#--stratified sample--
sample= genes.groupby("category_type", group_keys=False).sample(frac= 0.05)

gg1.add_edges_from(sample[["gene_name", "evidence_code"]].values)
gg1.add_edges_from(sample[["evidence_code", "go_id"]].values)
gg1.add_edges_from(sample[["go_id", "category_type"]].values)

After a stratified sample, the data is ready for visualization. The vantage point I’m interested in after examining the data involves evidence_codes – the notes on the function of a gene. There are 26 unique codes in existence divided into 6 main code categories:

  1. experimental evidence
  2. phylogenetic evidence (inferred through experimental evidence)
  3. computational evidence
  4. author statements (statements made by authors in scientific papers)
  5. curatorial statements (an expert from a database)
  6. automatically generated annotations (automated, no person reviewed this)

Note: The gene data used here only contains 13 unique codes after NA removal.

Whats important about this?

The evidence codes assigned to the genes are not “equal” in the sense that one code may be more important than another, in a sort of credibility hierarchy. “Non-traceable author statements” naturally have less credibility than those indicated through an actual experiment. Lowering the opacity on gene annotations whose origin is less credible would add an uncluttered extra layer of information on the network graph.

As a draft, the visual below shows how I would rate the evidence codes on an alpha scale.Those under experimental evidence (which I would say is most credible) will be graded as 1 ( EXP, IDA, IPI, IPM , IGI, and IEP). The lowest value will be 0.2 assigned to IEA , ND (and NAS upon closer inspection)

Research Question

With that said, the research question for the rest of this project will be:

“Do certain GO aspects (category types) tend to rely on particular evidence types? If so, which genes appear within those pathways?”

Mapping Categories

I’ll map the definitions for the evidence code acronyms for readability:

#-- The list-- 
ec_names= {"IEA": "Inferred from Electronic Annotation",
"IDA": "Inferred from Direct Assay",
"ISS": "Inferred from Sequence or Structural Similarity",
"IMP": "Inferred from Mutant Phenotype",
"IPI": "Inferred from Physical Interaction",
"TAS": "Traceable Author Statement",
"ND": "No Biological Data Available",
"IBA": "Inferred from Biological Aspect of Ancestor",
"IEP": "Inferred from Expression Pattern",
"EXP": "Inferred from Experiment",
"NAS": "Non-traceable Author Statement",
"IGI": "Inferred from Genetic Interaction",
"IC": "Inferred by Curator"}

cta_types= {"P": "Process","F": "Function", "C": "Component"}
#-- mapping evidence names--
genes["evidence_names"]= genes["evidence_code"].map(ec_names)

#-- Mapping Aspect/ Category Types--
genes["aspect"]= genes["category_type"].map(cta_types)

Below is a table of evidence codes and how they’re distributed within each category:

#--Starter Scores--
credibility_score= {"EXP": 1, "IDA": 1, "IPI": 1, "IMP": 1, "IGI": 1, "IEP": 1, "ISS": 1, "IBA": 0.5, "IC": 0.5, "TAS": 0.5,"IEA": 0.3, "NAS": 0.2, "ND": 0.2}

genes["cred_score"]= genes["evidence_code"].map(credibility_score)

cred_count= genes.groupby(["aspect", "evidence_names", "cred_score"]).size().reset_index(name="count").sort_values(["aspect", "cred_score"], ascending= [True,False])
#--Pivotting for Evidence code counts--
piv_cred_count= cred_count.pivot(index="evidence_names", columns="aspect", values="count").fillna(0)

piv_cred_count= piv_cred_count[["Process", "Function", "Component"]]
piv_cred_count= piv_cred_count.sort_values(by="Process", ascending=False)

piv_cred_count
aspect Process Function Component
evidence_names
Inferred from Electronic Annotation 5072.0 1747.0 683.0
Inferred from Direct Assay 1877.0 864.0 419.0
Inferred from Sequence or Structural Similarity 1502.0 190.0 116.0
Inferred from Mutant Phenotype 1287.0 107.0 15.0
Traceable Author Statement 622.0 411.0 84.0
Inferred from Biological Aspect of Ancestor 437.0 267.0 131.0
Non-traceable Author Statement 162.0 76.0 30.0
Inferred by Curator 79.0 15.0 12.0
Inferred from Genetic Interaction 70.0 8.0 1.0
Inferred from Expression Pattern 62.0 0.0 0.0
Inferred from Physical Interaction 3.0 174.0 16.0
Inferred from Experiment 2.0 84.0 0.0
No Biological Data Available 1.0 1.0 1.0

Evidence Code Composition

I added grades to each evidence code reflecting the credibility hierarchy discussed earlier. It will double as alpha values once the graph is rendered. The item to address address is the evidence code composition within each aspect. I need to normalize the groups to be able to compare them to each other. Its also important to take into account the volume (count) and value (cred_score) of each scored evidence code in each category.

I’ll create a dataframe that will support visulization of my assigned credibiblity scores. First, I’ll calculate the rates in the evidence codes by group. My initial instinct was to calculate proportion, but I need to normalize to compare the evidence codes within a category because the groups are all different sizes.

#-- evidence code rates, by category--
cred_count["ec_rates"]= (cred_count["count"] / cred_count.groupby("aspect")["count"].transform("sum")).round(3)

Next, I’ll add negative values to the datapoints whose credibility is less than 0.5

cred_count.loc[cred_count["cred_score"]<0.5, "ec_rates"]= -cred_count["ec_rates"]

I’ll adjust the dataframe:

cred_plot= pd.pivot_table(cred_count, index= "aspect", columns= "evidence_names", values= "ec_rates")
Composition on Plotly

Now well put the plot together. The evidence codes that have a low credibility score appear on the left, while the codes that have a high credibility score appear on the right.

div_plot= go.Figure()

for col in cred_plot.columns:
  div_plot.add_trace(go.Bar(x= cred_plot[col], y=cred_plot.index, orientation= "h", name= col, hovertemplate= "%{y}: %{x:.3f}"))

div_plot.update_layout(template= "plotly_white", barmode= "relative", height=600, width=900, yaxis_autorange="reversed", bargap=0.5, legend_orientation="v", title= "Evidence Code Rates and Credibility by Annotation Category Type (Aspect)", title_x= 0.5, title_font_size=18)

#-- adding annotations for right and left-- 
div_plot.add_annotation(x= 0.25, y= 1.1, text= "credible → ", xref="x", yref= "paper",showarrow=False, font=dict(size= 12, color= "black"))

div_plot.add_annotation(x= -0.25, y= 1.1, text= "← less credible", xref="x", yref= "paper", showarrow=False, font=dict(size= 12, color= "black"))

div_plot.show()

Centrality of Nodes in a Graph

Centrality and eigenvector centrality will help identify which nodes are the most connected and which of those matter.

centrality= nx.degree_centrality(gg1)
eig_centrality= nx.eigenvector_centrality(gg1, max_iter=2000)

#-- creating new columns for centrality-- 
gene_centrality= (sample[["gene_name"]].drop_duplicates())
gene_centrality["degree_centrality"]= gene_centrality["gene_name"].map(centrality)
gene_centrality["eig_centrality"]= gene_centrality["gene_name"].map(eig_centrality)


top_genes= gene_centrality.sort_values("eig_centrality", ascending=False).head(30)

#-- isolating top genes--
top_genes_2= gene_centrality.sort_values("eig_centrality", ascending=False).head(8)
top_sample= sample[sample["gene_name"].isin(top_genes["gene_name"])]
#--creating summary of centrality by nodes--
centrality_summary= pd.DataFrame({
  "node": list(centrality.keys()),
  "deg_centr": list(centrality.values()),
  "eig_centr": list(eig_centrality.values())})
  
#-- attaching the categories--
cats= genes[["go_id", "category_type"]].drop_duplicates()
cats.columns= ["node", "category_type"]

#--putting them together--
centrality_summary= centrality_summary.merge(cats, on="node", how= "left")

#--labeling the nodes correctly--
centrality_summary["category_type"]= centrality_summary["category_type"].fillna("gene")

ec_names_set= set(ec_names.keys())

centrality_summary["category_type"]= (centrality_summary["node"].map(
  lambda x: 
        "evidence_code" if x in ec_names_set
        else "aspect"    if x in {"C", "F", "P"}
        else "go_term"   if x in set(genes["go_id"].dropna())
        else "gene"))

Creating a Graph, Version 2

#--resetting graph--
gg1= nx.Graph()

#-- edges--
gg1.add_edges_from(top_sample[["gene_name", "evidence_code"]].values)
gg1.add_edges_from(top_sample[["evidence_code", "go_id"]].values)
gg1.add_edges_from(top_sample[["go_id", "category_type"]].values)

#--node gruops--
gene_nodes= set(top_sample["gene_name"])
evidence_nodes= set(top_sample["evidence_code"])
go_nodes= set(top_sample["go_id"])
category_nodes= set(top_sample["category_type"])

I’ll assign a scaling function over the nodes in my graph. In testing I find that despite sampling there is still too many nodes and edges in the graph to make it readable. Adding a visual priority to nodes that have the highest eigen values/ are most important and connected to other important nodes.

#-- trying to emulate plotly_white's colors--"
category_color_map= {"P": "#636EFA", "F": "#00CC96", "C": "#EF553B"}
gene_color= "#AB63FA"
go_color= "#FFA15A"
evidence_code_color= "#B6B6B6"

#-- defining the scale--
base_sizes= {
    "category": 800,
    "gene": 300,
    "evidence_code": 200,
    "go_term": 20}
scale= 2500

#-- creating lists--
node_colors= []
node_sizes= []

#-- a loop to assign my specs--
for node in gg1.nodes():
    if node in category_nodes:
      node_colors.append(category_color_map[node])
      base= base_sizes["category"]
    elif node in gene_nodes:
        node_colors.append(gene_color)
        base= base_sizes["gene"]
    elif node in evidence_nodes:
        node_colors.append(evidence_code_color)
        base= base_sizes["evidence_code"]
    elif node in go_nodes:
        node_colors.append(go_color)
        base= base_sizes["go_term"]
    else:
        node_colors.append("#E5E5E5")
        base= 40
    eig= eig_centrality.get(node, 0)
    node_sizes.append(base + eig * scale)


#--spacing fix for graph--
#pos_init= {
   ## "P": np.array([0, 0]),
   # "F": np.array([30, 20]),
   # "C": np.array([25, -25])}
        
#pos= nx.spring_layout(gg1,pos=pos_init, fixed= list(pos_init.keys()), seed=33, k=8, iterations=2000)

I’ll assigned a scaling function over the nodes in my graph. In testing I found that despite sampling there were still too many nodes and edges in the graph to make it readable. Adding a visual priority to nodes that have the highest eigen values/ are most important and connected to other important nodes adds to the readability of the graph.

Next I encountered issues with overlapping nodes. Cutting down the data again was not something I could accept. Upon researching the issue, it was possible to choose anchor points on the graph in the form of arrays. I selected the aspect nodes as anchors but the graph was still not appearing in a satisfactory way.

The closest thing to a solution came from “GraphViz AGraph” a module for NetworkX.2 The module can be modified a little easier.

pos= graphviz_layout(gg1, prog="sfdp", args="-Goverlap=false -Gsep=+5")
#--not labeling go nodes--
labels= {n: n for n in gg1.nodes() if n in gene_nodes| category_nodes| evidence_nodes}

#-- drawing the graph--
plt.figure(figsize=(20,20), dpi=300)
#--changing background--
plt.gca().set_facecolor("white")
plt.gcf().set_facecolor("white")

nx.draw_networkx_edges(gg1, pos, alpha=0.15, width=0.4, edge_color="#1C1C1C")
nx.draw_networkx_nodes(gg1, pos, node_color=node_colors, node_size=node_sizes)
nx.draw_networkx_labels(gg1, pos, labels=labels, font_size=8, font_color="#2a2a2a")

plt.title("Gene Ontology Annotations:\nNetworks Between GO IDs, Evidence Codes, and Associated Genes", fontsize=12, fontweight="normal", color="#1C1C1C")
plt.axis("off")
plt.show()

Graph Descriptive Stats
print("Nodes:", gg1.number_of_nodes())
print("Annotation relationships (edges):", gg1.number_of_edges())
print("Density:", nx.density(gg1))
print("Avrage Connections (degree):", sum(x for n, x in gg1.degree()) / gg1.number_of_nodes())
Nodes: 112
Annotation relationships (edges): 207
Density: 0.033301158301158304
Avrage Connections (degree): 3.6964285714285716

Comparing Centrality Across Categorical Groups

The categories assigned to the genes describe what the particular gene participates in. As we’ve seen, genes in this dataset predominantly participate in biological processes. And less so as a singular component or function.

#--creating summary of centrality by nodes--
centrality_summary= pd.DataFrame({
  "node": list(centrality.keys()),
  "deg_centr": list(centrality.values()),
  "eig_centr": list(eig_centrality.values())})
  
#-- attaching the categories--
cats= genes[["go_id", "category_type"]].drop_duplicates()
cats.columns= ["node", "category_type"]

#--putting them together--
centrality_summary= centrality_summary.merge(cats, on="node", how= "left")

#--labeling the nodes correctly--
centrality_summary["category_type"]= centrality_summary["category_type"].fillna("gene")

ec_names_set= set(ec_names.keys())

centrality_summary["category_type"]= (centrality_summary["node"].map(
  lambda x: 
        "evidence_code" if x in ec_names_set
        else "aspect"    if x in {"C", "F", "P"}
        else "go_term"   if x in set(genes["go_id"].dropna())
        else "gene"))
centrality_summary.groupby("category_type")[["deg_centr", "eig_centr"]].describe()
deg_centr eig_centr
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
category_type
aspect 3.0 0.174763 1.588152e-01 0.047319 0.085804 0.124290 0.238486 0.352681 3.0 0.172731 0.220775 0.023699 0.045913 0.068127 0.247247 0.426366
evidence_code 12.0 0.085174 1.293122e-01 0.003785 0.006940 0.033123 0.091009 0.447319 12.0 0.064117 0.154816 0.000250 0.001863 0.009945 0.041232 0.549412
gene 740.0 0.000673 1.669654e-04 0.000631 0.000631 0.000631 0.000631 0.001893 740.0 0.009329 0.008321 0.000008 0.001606 0.002614 0.018095 0.021985
go_term 831.0 0.001262 4.339420e-19 0.001262 0.001262 0.001262 0.001262 0.001262 831.0 0.018935 0.010209 0.000789 0.014892 0.016657 0.032138 0.032138

Aspect has the highest amount of connections, which makes sense, all gene annotations are labeled with an aspect, so they’re all connected to aspect. Same idea goes for evidence code. Gene is the least connected with a mean of 0.000681, this also makes sense because a gene’s functional annotations are specific.

fig, axes= plt.subplots(1, 2, figsize=(12, 5))
palette= ["#9F99E8", "#5DCAA5", "#F0997B", "#F4C0D1"]
#---Degree Cent---
sns.boxplot(data=centrality_summary,
    x="category_type",
    y="deg_centr",
    palette=palette,
    ax=axes[0])
axes[0].set(title="Degree Centrality \nMany neighbors?", xlabel="", ylabel="")
axes[0].title.set_fontsize(14)

#--- Eigenvector Centr---
sns.boxplot(
    data=centrality_summary,
    x="category_type",
    y="eig_centr",
    palette=palette,
    ax=axes[1])
axes[1].set_yscale("log")
axes[1].set(title="Eigenvector Centrality (logged) \nMany important neighbors?", xlabel="", ylabel="")
axes[1].title.set_fontsize(14)

plt.tight_layout()
plt.show()

The boxplot is also a great visual for the connections in this data. Aspect and evidence code are the hubs, we see they are really the only visible boxes in Degree Centrality because they are what other nodes connect to. There’s one evidence code that’s an outlier seen floating to the top of the plot, exponentially more connected than any other evidence code. Our final graph depicted this, IEA was a large node in the center.

In the Eigenvector centrality plot also works in the same manner, the boxes floating to the top are most important. Aspect is most importannt and evenly spaced. I logged it to see the differences between them more clearly. Aspect has many connections and many of those connections are important.

What we’ve seen answers the question Do category types rely on a particular evidence types?

Yes, we see that the categories (aspects) rely on IEA (Inferred from Electronic Annotation). However, the research question was more specific.

“Do certain GO aspects (category types) tend to rely on particular evidence types? If so, which genes appear within those pathways?”

We’ll need to look at pair wise relations to answer this. The Kruskal Wallis test calculates ranks shared across the categories, and because our data is not evenly distributed this nonparametric test is best.

Statistical Test on Categorical Data

Kruskal Wallis Test

“Are there differences between groups?”

#--grouping deg centr--
groups_dc= [group["deg_centr"].values
for name, group in centrality_summary.groupby("category_type")]

#-- grouping eig-- 
groups_ec= [group["eig_centr"].values
for name, group in centrality_summary.groupby("category_type")]

#-- Kruskal-- 
k_dc= kruskal(*groups_dc)
k_ec= kruskal(*groups_ec)

#-- results--
print("Kruskal Wallis Results")
print(f"(deg_centr): H= {k_dc.statistic:.3f}, p-val= {k_dc.pvalue:.3e}")
print(f"(eig_centr): H= {k_ec.statistic:.3f}, p-val= {k_ec.pvalue:.3e}")
Kruskal Wallis Results
(deg_centr): H= 1400.208, p-val= 2.654e-303
(eig_centr): H= 254.699, p-val= 6.302e-55

The resulting p-values of the Kruskal test are incredibly low, so the result is considered significant. There is absolutely a difference between categories. Since the result was significant, I’ll run a Dunn post-hoc test to it calculate ranks across groups. Its considered the correct adhoc for KW.3

Post-Hoc: Dunn

We’ve established that IEA evidence code is an extreme outlier. This skews the means. We’ll pull the medians in from the centrality dataframe to compare with the Dunn results. The Dunn Post Hoc calculates the pairwise ranks which adds context to the Kruskal.

#--Grabbing Medians--
print(centrality_summary.groupby("category_type")[["deg_centr", "eig_centr"]].median().round(3))
               deg_centr  eig_centr
category_type                      
aspect             0.124      0.068
evidence_code      0.033      0.010
gene               0.001      0.003
go_term            0.001      0.017
# --- Dunn's post-hoc ---
dunn_deg= sp.posthoc_dunn(centrality_summary, val_col="deg_centr", group_col="category_type", p_adjust="fdr_bh")

dunn_eig= sp.posthoc_dunn(centrality_summary, val_col="eig_centr", group_col="category_type", p_adjust="fdr_bh")

print("Degree centrality:\n", dunn_deg)
print("\nEigenvector centrality:\n", dunn_eig)
Degree centrality:
                      aspect  evidence_code           gene        go_term
aspect         1.000000e+00   9.883047e-01   4.837137e-07   5.925995e-02
evidence_code  9.883047e-01   1.000000e+00   4.219411e-24   1.586584e-04
gene           4.837137e-07   4.219411e-24   1.000000e+00  9.588917e-293
go_term        5.925995e-02   1.586584e-04  9.588917e-293   1.000000e+00

Eigenvector centrality:
                  aspect  evidence_code          gene       go_term
aspect         1.000000       0.042090  1.980449e-03  6.328206e-02
evidence_code  0.042090       1.000000  1.180762e-01  2.798842e-01
gene           0.001980       0.118076  1.000000e+00  5.549853e-55
go_term        0.063282       0.279884  5.549853e-55  1.000000e+00

Dunn’s output identifies gene and aspect (category_type), gene and evidence_code, gene and go_term, evidence_code and go_term, as having significant differences between pairs.

The pairs identified by eigenvector centrality are gene and go_term, gene and aspect, and gene and evidence_code. These pairs have significant differences.

So, the research question is answered. There is no statistically significant difference is shown between aspect and evidence. Gene nodes appear in nearly all the pair comparisons found in Dunns output as having significant differences.

Spearman Rank Relations

cred_score_map = dict(zip(genes["gene_name"], genes["cred_score"]))
centrality_summary["cred_score"] = centrality_summary["node"].map(cred_score_map)
# drop missing credibility scores first
s_r= centrality_summary.dropna(subset=["cred_score"])

corr_deg, p_deg= spearmanr(s_r["cred_score"], s_r["deg_centr"])
corr_eig, p_eig= spearmanr(s_r["cred_score"], s_r["eig_centr"])

print("Spearman (deg):", corr_deg, "p-val(deg):", p_deg)
print("Spearman (eig):", corr_eig, "p-val(eig):", p_eig)
Spearman (deg): 0.08197957164982358 p-val(deg): 0.025742785443654166
Spearman (eig): -0.39839888156367137 p-val(eig): 1.4693999109911575e-29

The spearman rank test adds one last layer in this analysis. Its results show that there is a weak but positive relationship between credibility score and centrality, and a negative relationship between credibility score and eigen centrality. So the however slight there higher degree centrality is associated with higher credibility score. and in a stronger relationship, lower eigen centrality is associated with higher credibility.

In all, our exploration of the gene ontology data tells us that the network is structured arounf aspects and evidence codes, that influence and importance does not equal credibility and that genes nodes are very much


References

Matplotlib List of Named colors:
https://matplotlib.org/stable/gallery/color/named_colors.html

Plotly Diverging Bar Chart:

https://www.geeksforgeeks.org/python/diverging-bar-chart-using-python/

Customizing Plotly:

https://plotly.com/python/templates/https://plotly.com/python/templates/

Fixing spacing issue with Graphviz_layout()

https://networkx.org/documentation/stable/_modules/networkx/drawing/nx_agraph.html#graphviz_layout

NetworkX Crash Course: Graph Theory in Python:

https://youtu.be/VetBkjcm9Go?si=KtmFHoyRwoprbuub

Introduction nto Network Analysis in Python

https://www.datacamp.com/courses/introduction-to-network-analysis-in-python?utm_medium=organic_social&utm_campaign=sharewidget&utm_content=coursedetailpage