######################################################################## # # PHYLOCLUS Program Suite # ######################################################################## #################### # QUICK START #################### PHYLOCLUS is a collection of programs that can perform phylogenetic footprinting and motif clustering. It has two components: the motif finding component and the motif clustering component. To start using the program, using the following command to unzip the package: tar -xvzf phyloclus.tgz It will unpack all the program components into a directory "phyloclus". The following is a quick guide on how to use the program, for more detailed information please read the 'DETAILED DESCRIPTION' section below. 1. It is assumed that you have already had your sequence dataset file ready. (You have to find upstream sequences of orthologous genes in the genomes of your interest and organize them into appropriate format (see below in "Motif finding component")) 2. create directories to put all the external programs needed in PHYLOCLUS and specify the directories (absolute path) in the related entries in the parameters file "phyloclus_params.txt". These programs and their corresponding entries in the parameters file are: biopros BioProspector bioopt BioOptimizer genomebg_prog genomebg.linux Note: (1). you have to obtain these programs from their original authors. The genomebg.linux program can be obtained from Dr. X.S. Liu (http://genome.dfci.harvard.edu/xsliu/). BioProspector and BioOptimizer can be obtained from sources below: BioProspector: http://ai.stanford.edu/~xsliu/BioProspector/ BioOptimizer: http://stat.wharton.upenn.edu/~stjensen/BO/BioOptimizer.web.cgi (2). If a program is installed such that you can directly type its name at any directory to execute it, leave an empty string '' as its path in the "phyloclus_params.txt" file. 3. create directories to put all the internal programs of PHYLOCLUS and specify them in the "phyloclus_params.txt" file. These programs and their corresponding entries in the parameters file (first column is the entries, second column is the program names) are: generate_index generate_index.pl get_entry get_entry.pl motif_search bpbo_motif_search.pl hclus all the clustering programs, including: hclus.cross.oneblock.perl hclus.cross.twoblock.perl hclus.cross.varw.perl posteriormode.oneblock.perl posteriormode.twoblock.perl individualprobabilities.varw.perl usually since all these programs are unzipped to the same 'phyloclus' directory, you can just use that as the path for all these programs. 4. create directories for storage of temperary result or script files and specify them in the "phyloclus_params.txt" file: biopros_tmp_res for temporarily storing BioProspector program results bioopt_tmp_res for temporarily storing BioOptimizer program results genomebg_files for temporarily storing genomebg.linux program results script_files for temporarily storing script files to be used by qsub result_files for temporarily storing result files of bpbo_motif_search.pl 5. Modify other parameters as you like in the phyloclus_params.txt file. 6. run phyloclus.pl as follows: phyloclus.pl -seqfile [your sequence file] -params_file phyloclus_params.txt & if you want to use genome specific background, run the following command: phyloclus.pl -seqfile [your sequence file] -params_file phyloclus_params.txt -genome_specific_bg -bg_seqfile [your background sequence file] & if you have a Linux cluster and can use qsub to submit jobs, add -use_qsub option to the command. The program will then generate qsub scripts (stored in the directory specified in the script_files entry in the parameters file) for each orthologous sequences dataset and submit those scripts using the qsub mechanism. It will also generate a script for the clustering component, which can be run after all the qsub jobs are finished. if want to use variable width clustering, add -varw_clustering option to the command.(Note: currently only one-block variable width clustering is available). ############################ # DETAILED DESCRIPTION ############################ ######################################################################## # User input parameters ######################################################################## The program takes its parameter values from the "parameter file" specified by the -params_file option. The User can change these parameters by editing this file. The parameter file for PHYLOCLUS is phyloclus_params.txt. Each line which is an entry in the file has the following format: key1[tab]key2[tab]parameters anything following a '#' is treated as comment and ignored. Users can change the parameters to their own, but can not change key1 and key2 since they are predefined by the program. However, the user can add certain entries. For example, they can add: biopros -a 1 which is a BioProspector option that specifies that every sequence contains the motif. ######################## Motif finding component ######################## The component consists of a single program: bpbo_motif_search.pl, which finds motifs by iteratively using two programs: BioProspector and BioOptimizer. Briefly, at each iteration, BioProspector is first used to search for motifs with different fixed widths (specified by the user) on the sequence dataset. After removing redundant motifs, the resultant motifs are all optimized by BioOptimizer and the highest scoring motif is kept and masked out for the following iteration. This iterative procedure is run until no significant motifs are found (i.e. no significant motifs after BioOptimizer optimization). Input sequence format: Input sequences must follow a 'embedded FASTA format'. Basically, a group of sequences is represented by their sequences in restricted FASTA format (same as regular FASTA format except each sequence must be in one line) preceded by an ID line which starts with a '>>' and immediately followed by an ID containing no white spaces. This is like a group of small FASTA files 'embedded' into a larger FASTA format. For each sequence in their restricted- FASTA format, their ID line must contain only IDs preceded with a '>', otherwise it causes a parsing error of the BioProspector output. You can have one or multiple groups of sequences in one file. You can be even more versatile to have all groups of sequences in one file and search only one group at a time by using the '-id' option. This allows multiple searches to be performed at the same time if you have a clustered computing facility. An example to illustrate the input sequence format is shown below: >>16077069 >ba:30260196 GGAAAATGTCAATGGTTCTAGAAATTTTCATCAAATATTCATTTTGAACCTATTCACATA >bh:15612564 TTTGTAAATGATCTCCTTAGTCTATTTTACAACAACATCACTGTGGATAACATCTATCCA >bs:16077069 TCAACTTTCGAAACCTTATTTTTTAGATTCCTTAATTTTACGGAAAAAAGACAAATTCAA >ca:15893299 TACTTTATAATTATATATTTCAATATAGCCAATGTCAAGAAAAACGATTCTGCCAACCTT >cp:18308983 CTTAACTTCTATTATGTTAAAAATTCAAATTTCAAAAGATATTTTCCCTAATATATACCT >li:16799080 CAACCTCATCTTTTGTCTTTTTTTTGGTATTCTGCTAAATAGTTTAACACATTGTCGCAA >oi:23097456 CATTGATTTTCTGCTTTTTATTTGGTATTATATTTATGTTTTAACCTGTGGAAACAGAAA Genome specific background: The program has an option '-genome_specific_bg' that allows for markov background frequencies calculated from specific combination of genomes. For example, if a promoter sequence dataset contains sequences from genome A, B and C, during the BioProspector motif search, it will be advantageous to use all promoter region sequences from genome A, B and C to calculate markov background frequencies that are used by BioProspector. To satisfy all cases, we need to compute the markov background frequencies from all combinations of genomes. That is totally 2^n-1 combinations for n genomes. The program has an option '-comp_genomebg' to do this computation, by issuing the following command: bpbo_motif_search.pl -comp_genomebg -seqfile [your sequence file] -params_file [parameter file] The program will create a directory according to the path set in the parameter file to store the background files for all the combinations. And they will be used if you specify the '-genome_specific_bg' option in motif search. ########################### Motif clustering component ########################### One-Block Fixed Width Clustering Programs ----------------------------------------- 1. hclus.cross.oneblock.perl - program that clusters single-block matrices command line: > perl hclus.cross.indblock.perl inputfile width numiters outputname This program produces four files: "outputname.oneblock.clus.width.iters.input" - input matrices (use as matrixfile in second step) "outputname.oneblock.clus.width.iters.align" - alignment within each matrix by iteration (use as alignfile in second step) "outputname.oneblock.clus.width.iters.clusind" - clustering indicators for each matrix by iteration (use as clusindfile in second step) "outputname.oneblock.clus.width.iters.clusprop" - pair wise clustering proportions for each matrix 2. posteriormode.oneblock.perl: program that calculates best partition from cluster results command line: > perl posteriormode.oneblock.perl matrixfile alignfile clusindfile width numiters outputname This program produces two files: "outputname.bestclusters" - best partition of clusters "outputname.clusstats" - clustering statistics (number of clusters, etc.) Two-Block Fixed Width Clustering Programs ----------------------------------------- 1. hclus.cross.twoblock.perl - program that clusters two-block matrices command line: > perl hclus.cross.twoblock.perl inputfile width numiters outputname This program produces four files: "outputname.twoblock.clus.width.iters.input" - input matrices (use as matrixfile in second step) "outputname.twoblock.clus.width.iters.align" - alignment within each matrix by iteration (use as alignfile in second step) "outputname.twoblock.clus.width.iters.clusind" - clustering indicators for each matrix by iteration (use as clusindfile in second step) "outputname.twoblock.clus.width.iters.clusprop" - pair wise clustering proportions for each matrix 2. posteriormode.twoblock.perl: program that calculates best partition from cluster results command line: > perl posteriormode.twoblock.perl matrixfile alignfile clusindfile width numiters outputname This program produces two files: "outputname.bestclusters" - best partition of clusters "outputname.clusstats" - clustering statistics (number of clusters, etc.) One-Block Variable Width Clustering Programs -------------------------------------------- 1. hclus.cross.varw.perl - program that clusters single-block matrices and allows motif width to vary command line: > perl hclus.cross.varw.perl inputfile minwidth priorwidth numiters outputname where "minwidth" is a minimum width for the matrices and "priorwidth" is the expected width for the matrices This program produces five files: "outputname.clus.varw.minwidth.priorwidth.iters.input" - input matrices (use as matrixfile in second step) "outputname.clus.varw.minwidth.priorwidth.iters.align" - alignment within each matrix by iteration (use as alignfile in second step) "outputname.clus.varw.minwidth.priorwidth.iters.clusind" - clustering indicators for each matrix by iteration (use as clusindfile in second step) "outputname.clus.varw.minwidth.priorwidth.iters.clusprop" - pairwise clustering proportions for each matrix "outputname.clus.varw.minwidth.priorwidth.iters.width" - motif widths for each matrix by iteration (use as widthfile in second step) 2. posteriormode.varw.perl: program that calculates best partition from cluster results command line: > perl posteriormode.varw.perl matrixfile alignfile clusindfile widthfile priorwidth numiters outputname This program produces two files: "outputname.bestclusters" - best partition of clusters "outputname.clusstats" - clustering statistics (number of clusters, etc.) 3. individualprobabilities.varw.perl: program that calculates individual clustering probabilities for best partition command line: > perl individualprobabilities.varw.perl matrixfile bestclustersfile priorwidth This program produces one files: "bestclustersfile.new" - new best partition of clusters with individual clustering probabilities included. Format of One-Block Input Matrix Files -------------------------------------- >matrixname1 A 1 2 0 0 0 1 0 0 0 0 C 3 2 2 2 0 3 4 0 0 3 G 0 0 1 2 1 0 0 0 1 0 T 0 0 1 0 3 0 0 4 3 1 >matrixname2 A 0 1 0 3 3 4 4 0 4 0 0 C 2 0 0 0 0 0 0 1 0 0 0 G 0 0 0 0 1 0 0 0 0 0 0 T 2 3 4 1 0 0 0 3 0 4 4 >matrixname3 A 11 0 0 11 3 8 0 0 0 1 C 1 3 10 0 5 0 9 1 0 2 G 0 4 3 2 2 2 4 3 1 2 T 1 6 0 0 3 3 0 9 12 8 Format of Two-Block Input Matrix Files -------------------------------------- >matrixname1block1 A 1 2 0 0 0 1 0 0 0 0 C 3 2 2 2 0 3 4 0 0 3 G 0 0 1 2 1 0 0 0 1 0 T 0 0 1 0 3 0 0 4 3 1 >matrixname1block2 A 0 1 0 3 3 4 4 0 4 0 0 C 2 0 0 0 0 0 0 1 0 0 0 G 0 0 0 0 1 0 0 0 0 0 0 T 2 3 4 1 0 0 0 3 0 4 4 >matrixname2block1 A 11 0 0 11 3 8 0 0 0 1 C 1 3 10 0 5 0 9 1 0 2 G 0 4 3 2 2 2 4 3 1 2 T 1 6 0 0 3 3 0 9 12 8 >matrixname2block2 A 11 11 10 0 1 7 0 1 8 C 0 1 3 7 2 1 1 1 2 G 1 0 0 6 1 5 3 8 3 T 1 1 0 0 9 0 9 3 0