--modified on 11/12/08 Add refsrc=src option to maf2fasta. The default is to use the source of the first component of the first alignment block in the maf file. -- See README2 for more updated programs. 10/12/05 -- big revisions 1/13/05 --modified on 11/23/04 add a switch to maf_order to keep single-row alignment or not. --modified on 11/22/04 added program descriptions. --modified on 11/02/04 maf_order is added. README version 10 -- modified on 11/01/04 add a small tool "get_standard_headers", which return in format of "1:end:+:srcSize" assuming the sequence contig starts at position 1. The start position in fasta header is inclusive while the end position is exclusive, so the srcSize is one larger than end. README version 10 -- modified on 10/26/04 -- maf2lav and maf2fasta allow the species with more than one contigs. Parameters are the same as before. README version 10 -- modified on 10/25/04 -- Multiz maf_Project maf_project_simple programs are removed. -- all_bz multiz programs no longer need mapping file. README version 10 -- modofied on 10/09/04 -- Var_multiz program becomes Multiz version 10. -- var_multiz program becomes multiz version 10. README version3 -- modified on 01/13/05 _________________________________________________________ --------------------------------------------------------- | Examples commands to run all_bz, tba, multiz | --------------------------------------------------------- all_bz "(((((human chimp) galago)(mouse rat)) chicken)(tetra (fugu zebrafish)))" blastz_spec_file tba "(((((human chimp) galago)(mouse rat)) chicken)(tetra (fugu zebrafish)))" *.maf destination-file multiz human.chimp.galago.maf human.mouse.rat.maf 1 maf_project tba.maf zebrafish _________________________________________________________ --------------------------------------------------------- | Some requirements | --------------------------------------------------------- *1. format of sequence header: >string1:string2:int1:char:int3 Note: string1 is usually species name, string2 is usually chromosome information. "string1.string2" is used as src field in maf struct. int1 is start position(1 based, inclusive). int3 is src size. char is +/-. It's allowed not to contain standard header as long as there is only one contig in the sequence file. *2. MSA header can also be accepted. ">${COMMON_NAME}|${ENCODE_REGION}|${FREEZE_DATE}|${NCBI_TAXON_ID} |${ASSEMBLY_PROVIDER}|${ASSEMBLY_DATE}|${ASSEMBLY_ID}|${CHROMOSOME} |${CHROMOSOME_START}|${CHROMOSOME_END}|${CHROM_LENGTH}|${STRAND} |${ACCESSION}.${VERSION}|${NUM_BASES}|${NUM_N}|${THIS_CONTIG_NUM} |${TOTAL_NUM_CONTIGS}|${OTHER_COMMENTS} empty field shall be represented by ".". *3. mafFile component positions start at 0 instead of 1. *4. species-guid-tree is used as arugment in many of the following programs, it consists of double quotes, parenthesis, and species names, e.g. "((HUMAN CHIMP)(RAT MOUSE))" _________________________________________________________ --------------------------------------------------------- | program descriptions | --------------------------------------------------------- ---------------------< all_bz >-------------------------- all_bz [-+] species-guid-tree [blastz_spec] No mapping file is needed. all_bz is a wrapper for blastzWrapper, it generates blastzWrapper/blastz commands for pairs of specified sequences. Optional blastz-spec file contains command-lind options for the blastz runs. A '+' tells all_bz to echo each blastz command to stdout before executing it. A '-' tells all_bz to simply echo the commands ---------------------< blastzWrapper >-------------------------- blastzWrapper seq-file1 seq-file2 [options] [options] are the same ones as defined in "blastz" program. blastzWrapper processes two sequence files each containing one or more contigs and runs blastz. Each sequence contig must follow the sequence header format described above. ---------------------< lav2maf >------------------------ lav2maf blastz-file seq-file1 seq-file2 The program transform a blastz output format file blastz-file into maf format. seq-file1 and seq-file2 are sequence files used to run blastz. ---------------------< maf2lav >------------------------ maf2lav align.maf seq1 seq2 The program transform a maf format file align.maf into blastz output format. seq1 and seq2 are source sequences. -------------------------------< maf2fasta >--------------------------------- maf2fasta refseq-file maf-file [beg end] [fasta[2]][?] [iupac2n] [refsrc=src] The program transform a maf format file maf-file into fasta format with reference to refseq-file. The maf-file shall not have inversion or overlap regions. The maf-file shall be referenced(top row) by species of refseq-file. [beg end] option limits the resulted fasta region to be within beg and end in respect to the reference. [fasta[2]][?]: The default output format is that used for text alignments by MultiPipMaker; appending the command-line argument "fasta" or "fasta2" requests FastA-format output; with "fasta2", alignment rows are split into rows of length <= 50 (set by COL_WIDTH). To identify gaps as either "within -alignment" or "between-alignment", append a character to the word "fasta" or "fasta2" that will replace '-' between two local alignments, as in "fasta@" [iupac2n]: If a nucleotide character other than one of "ACGTNacgtn" is found in the refseq-file it is mapped to either "N" or "n" depending on its case. [refsrc=src]: Specify that the component source for the reference species is src. By default, fasta2maf will assume that the source of the first component of the first alignment block in the maf-file is the source of the reference species. --------------------< maf_order >---------------------- maf_order maf-file species1 species2 ... [nohead] [all] maf-file is the maf file to be ordered. It is followed by species names need to be included in the ordered file, the ordering of components in a maf block follows the ordering of the species names in the argments. Species not in the arguments are excluded. [nohead]: if nohead is specified, maf header is not shown in the result projection file. [all]: if all is specified, single-row blocks are also included, otherwise excluded from resulted file. ---------------------< maf_project >-------------------- maf_project maf-file reference [from to] [filename-for-other-mafs] [species-guid-tree] [nohead] maf_project is able to process maf file where there might be more than one contigs for any species. Arguments: reference: the sequence to which the maf file is projected. The result maf blocks always have reference sequence in the top row, and maf blocks are ordered by the starting position of the top row. [from to]: the output maf blocks are limited to [from to] area in respect to positions on reference sequence. [filename-for-other-mafs]: collect maf blocks not contained in projected file. WHEN THIS ARGUMENT IS NOT SPECIFIED, THE PROJECTED MAF BLOCKS ARE BEAUTIFIED. [species-guid-tree]: species not specified in this argument are screened out. [nohead]: if nohead is specified, maf header is not shown in the result projection file. ---------------------< mafFind >------------------------------- mafFind file.maf beg end [species-prefix] [slice] mafFind finds mafs intersecting a particular interval. The mafs whose first row intersects positions beg-end are printed. For non-reference species, a command like mafFind file.maf beg end mm3 asks for mafs that have a row where"mm3" is a prefix of the "source"(e.g., "mm3.chr7") and which intersects positions beg-end. Finally, the last argument "slice" asks that ends of the reported mafs be trimmed to make them precisely match beg-end, as in: mafFind file.maf beg end mm3 slice ---------------------< multiz version 10 >--------------------- multiz [R=?] [M=?] maf-file1 maf-file2 v [out1] [out2] [nohead] multiz version 10 allows each species containing more than one contigs without reqirement for mapping-file. maf-file1 and maf-file2 are two maf files to be aligned, each topped by a same reference sequence. The alignment of reference sequence with other components might be just for purpose of determing approximate alignment between two files, thus the alignment might be fixed or not, this is specified by v value, which can be only 0 or 1. 0 - neither alignment of reference in each file is fixed. 1 - the alignment of reference in the first file is fixed. [R=?] species radius values in dynamic programming, by default 30 [M=?] species minimum output width, by default 1, which means output all blocks. [out1] collects unused blocks from maf-file1 [out2] collects unused blocks from maf-file2 [nohead] specifies not to have maf header for output ---------------------< pair2tb >------------------------ pair2tb pairwise.maf seq-file1 seq-file2 The program assumes the input file pairwise.maf does not have overlapping blocks. It also assumes the top components correspond to seq-file1, and second components correspond to seq-file2. seq-file1 and seq-file2 allow for more than one contig. ---------------------< single_cov2 >-------------------- single_cov2 pairwise.maf [F=deleted.maf] This program removes overlapped regions from pairwise.maf. mapping file is not needed. Optional [F=deleted.maf] specifies filename collecting removed regions. ---------------------< tba >---------------------------- tba [+-] [R=?] [M=?] species-guid-tree maf_source destination-file tba program passes R M values to multiz program if any of them are provided. This version requires destination filename to be provided. arg '-' means no execute, '+' means verbose. ---------------------< maf_checkThread >----------------- maf_checkThread projected-maf-file this tool check the threading condition of the reference. It assumes the sequence starts at position 0. So there is one error when the sequence starts at a position other than 0. The alignment file must be projected onto reference. And the threading condition is checked for reference only. It has to be done for all species in TBA alignments to make sure the whole alignment satisifies threading condition.