Skip to content

LCrossman/punchline

Repository files navigation

Punchline Pfam domain searches

Definition:

punchline /ˈpʌn(t)ʃlʌɪn/ noun

the final phrase or sentence of a joke or story, providing the humour or some other crucial element.

Dependencies: HMMER3 installation instructions here: http://hmmer.org/documentation.html Python2.7 or Python3 Biopython - http://biopython.org/DIST/docs/install/Installation.html
R
R corrplot library install.packages("corrplot") from within R


First step:
Gather your files:
1. Download Pfam-A database from here: https://pfam.xfam.org and FTP tab at the top of the page Make a note of the version number in case you need to quote it later.
2. Collect all your proteins to search in one file as a protein fasta file - you will need to make sure each protein is named so that you know which strain it was from and all need a unique name such as H10407|H10407_0003. The strain name should be first, followed by a | character and the name of the coding sequence or locus tag. ** more details on how to do this if requested **
Second step:
Run Pfam - the fastest way to search a large number of protein sequences with a large number of Pfam motifs is to use hmmsearch.
hmmsearch --noali --notextw --cpu [insert number of cpus here] --domtblout your_outfile_name Pfam_A_hmm_profiles your_protein_fasta

Third step:
Run pfam_presence_absence.py
pfam_presence_absence.py hmmsearch_outfile Pfam_domain_names

This will create a large number of files, each with the top line of the species in question The Pfam_domain_names file is a list of all the Pfam domain shortnames you are searching with, as a text file with one name per line. An example file is provided in the examples folder and is specific for the 32.0 version of Pfam-A, released Sept. 2018.

Fourth step:
We join these files to an R matrix with:
join_and_create_dataset_punchline.R
Includes:
Kruskal-wallis significance testing using R

***What to do if the plot needs resizing ** coming soon!!

Fifth step:
Heatmap and correlation plot using Rscript
Rscript Heatmapper_punchline.R


Sixth step:
Phylogeny clustering
*still under construction
You will be able to build the input phylogeny file by: running plot_pfam_phylog.R Then run phylip neighbour on that file and afterwards run: plot_pfam_phylog.R If you get errors (usually about the strain names) you can try: validateforphylip.py


Look at the resulting tree with your favourite tree viewer program such as Figtree
Or try color_tree_labels.R in useful_scripts folder on github.

About

Punchline for pangenomes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published