.\" Automatically generated by Pod::Man 4.11 (Pod::Simple 3.35) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is >0, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{\ . if \nF \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{\ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" ======================================================================== .\" .IX Title "BP_GENBANK_REF_EXTRACTOR 1p" .TH BP_GENBANK_REF_EXTRACTOR 1p "2020-03-13" "perl v5.30.0" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" bp_genbank_ref_extractor \- Retrieves all related sequences for a list of searches on Entrez gene .SH "VERSION" .IX Header "VERSION" version 1.77 .SH "SYNOPSIS" .IX Header "SYNOPSIS" \&\fBbp_genbank_ref_extractor\fR [options] [Entrez Gene Queries] .SH "DESCRIPTION" .IX Header "DESCRIPTION" This script searches on \fIEntrez Gene\fR database and retrieves not only the gene sequence but also the related transcript and protein sequences. .PP The gene UIDs of multiple searches are collected before attempting to retrieve them so each gene will only be analyzed once even if appearing as result on more than one search. .PP Note that \fIby default no sequences are saved\fR (see options and examples). .SH "OPTIONS" .IX Header "OPTIONS" Several options can be used to fine tune the script behaviour. It is possible to obtain extra base pairs upstream and downstream of the gene, control the naming of files and genome assembly to use. .PP See the section bugs for problems when using default values of options. .IP "\fB\-\-assembly\fR" 4 .IX Item "--assembly" When retrieving the sequence, a specific assemly can be defined. The value expected is a regex that will be case-insensitive. If it matches more than one assembly, it will use the first match. It defauls to \f(CW\*(C`(primary|reference) assembly\*(C'\fR. .IP "\fB\-\-debug\fR" 4 .IX Item "--debug" If set, even more output will be printed that may help on debugging. Unlike the messages from \fB\-\-verbose\fR and \fB\-\-very\-verbose\fR, these will not appear on the log file unless this option is selected. This option also sets \fB\-\-very\-verbose\fR. .IP "\fB\-\-downstream\fR, \fB\-\-down\fR" 4 .IX Item "--downstream, --down" Specifies the number of extra base pairs to be retrieved downstream of the gene. This extra base pairs will only affect the gene sequence, not the transcript or proteins. .IP "\fB\-\-email\fR" 4 .IX Item "--email" A valid email used to connect to the \s-1NCBI\s0 servers. This may be used by \&\s-1NCBI\s0 to contact users in case of problems and before blocking access in case of heavy usage. .IP "B <\-\-api\-key>" 4 .IX Item "B <--api-key>" \&\s-1NCBI\s0 requires an \s-1API\s0 key for requests over 10/sec as of December 2018. You may generate one in the \*(L"My \s-1NCBI\*(R"\s0 area. .IP "\fB\-\-format\fR" 4 .IX Item "--format" Specifies the format that the sequences will be saved. Defaults to \fIgenbank\fR format. Valid formats are 'genbank' or 'fasta'. .IP "\fB\-\-genes\fR" 4 .IX Item "--genes" Specifies the name for gene file. By default, they are not saved. If no value is given defaults to its \s-1UID.\s0 Possible values are 'uid', 'name', 'symbol' (the official symbol or nomenclature). .IP "\fB\-\-help\fR" 4 .IX Item "--help" Display the documentation (this text). .IP "\fB\-\-limit\fR" 4 .IX Item "--limit" When making a query, limit the result to these first specific results. This is to prevent the use of specially unspecific queries and a warning will be given if a query returns more results than the limit. The default value is 200. Note that this limit is for \fIeach\fR search. .IP "\fB\-\-non\-coding\fR, \fB\-\-nonon\-coding\fR" 4 .IX Item "--non-coding, --nonon-coding" Some protein coding genes have transcripts that are non-coding. By default, these sequences are saved as well. \fB\-\-nonon\-coding\fR can be used to ignore those transcripts. .IP "\fB\-\-proteins\fR" 4 .IX Item "--proteins" Specifies the name for proteins file. By default, they are not saved. If no value is given defaults to its accession. Possible values are 'accession', 'description', 'gene' (the corresponding gene \s-1ID\s0) and 'transcript' (the corresponding transcript accesion). .Sp Note that if not using 'accession' is possible for files to be overwritten. It is possible for the same gene to encode more than one protein or different proteins to have the same description. .IP "\fB\-\-pseudo\fR, \fB\-\-nopseudo\fR" 4 .IX Item "--pseudo, --nopseudo" By default, sequences of pseudo genes will be saved. \fB\-\-nopseudo\fR can be used to ignore those genes. .IP "\fB\-\-save\fR" 4 .IX Item "--save" Specifies the path for the directory where the sequence and log files will be saved. If the directory does not exist it will be created although the path to it must exist. Files on the directory may be rewritten if necessary. If unspecified, a directory named \fIextracted sequences\fR on the current directory will be used. .IP "\fB\-\-save\-data\fR" 4 .IX Item "--save-data" This options saves the data (gene UIDs, description, product accessions, etc) to a file. As an optional value, the file format can be specified. Defaults to \s-1CSV.\s0 .Sp Currently only \s-1CSV\s0 is supported. .Sp Saving the data structure as a \s-1CSV\s0 file, requires the installation of the Text::CSV module. .IP "\fB\-\-transcripts\fR, \fB\-\-mrna\fR" 4 .IX Item "--transcripts, --mrna" Specifies the name for transcripts file. By default, they are not saved. If no value is given defaults to its accession. Possible values are 'accession', 'description', 'gene' (the corresponding gene \s-1ID\s0) and 'protein' (the protein the transcript encodes). .Sp Note that if not using 'accession' is possible for files to be overwritten. It is possible for the same gene to have more than one transcript or different transcripts to have the same description. Also, non-coding transcripts will create problems if using 'protein'. .IP "\fB\-\-upstream\fR, \fB\-\-up\fR" 4 .IX Item "--upstream, --up" Specifies the number of extra base pairs to be extracted upstream of the gene. This extra base pairs will only affect the gene sequence, not the transcript or proteins. .IP "\fB\-\-verbose\fR, \fB\-\-v\fR" 4 .IX Item "--verbose, --v" If set, program becomes verbose. For an extremely verbose program, use \fB\-\-very\-verbose\fR instead. .IP "\fB\-\-very\-verbose\fR, \fB\-\-vv\fR" 4 .IX Item "--very-verbose, --vv" If set, program becomes extremely verbose. Setting this option, automatically sets \fB\-\-verbose\fR as well. For help in debugging, consider using \fB\-\-debug\fR .SH "EXAMPLES" .IX Header "EXAMPLES" .Vb 3 \& bp_genbank_ref_extractor \e \& \-\-transcripts=accession \e \& \*(Aq"homo sapiens"[organism] AND H2B\*(Aq .Ve .PP Search Entrez Gene with the query \f(CW\*(Aq"homo sapiens"[organism] AND H2B\*(Aq\fR and save their transcripts sequences only. Note that default value of \fB\-\-limit\fR may only extract some of the hits. .PP .Vb 5 \& bp_genbank_ref_extractor \e \& \-\-transcripts=accession \-\-proteins=accession \e \& \-\-format=fasta \e \& \*(Aq"homo sapiens"[organism] AND H2B\*(Aq \e \& \*(Aq"homo sapiens"[organism] AND MCPH1\*(Aq .Ve .PP Save both transcript and protein sequences in the fasta format, for two queries, \f(CW\*(Aq"homo sapiens"[organism] AND H2B\*(Aq\fR and \f(CW\*(Aq"homo sapiens"[organism] AND MCPH1\*(Aq\fR. .PP .Vb 3 \& bp_genbank_ref_extractor \e \& \-\-genes \-\-down=500 \-\-up=100 \e \& \*(Aq"homo sapiens"[organism] AND H2B\*(Aq .Ve .PP Download genomic sequences, including 500 bp downstream and 100 bp upstream of each gene. .PP .Vb 3 \& bp_genbank_ref_extractor \e \& \-\-genes \-\-asembly=\*(AqAlternate HuRef\*(Aq \e \& \*(Aq"homo sapiens"[organism] AND H2B\*(Aq .Ve .PP Download genomic sequences from the Alternate HuRef genome assembly. .PP .Vb 2 \& bp_genbank_ref_extractor \-\-save\-data=CSV \e \& \*(Aq"homo sapiens"[organism] AND H2B\*(Aq .Ve .PP Do not save any sequence, only save the results in a \s-1CSV\s0 file. .PP .Vb 6 \& bp_genbank_ref_extractor \-\-save=\*(Aqsearch\-results\*(Aq \e \& \-\-genes=name downstream=500 \-\-upstream=200 \e \& \-\-nopseudo \-\-nonnon\-coding \-\-transcripts \-\-proteins \e \& \-\-format=fasta \-\-save\-data=CSV \e \& \*(Aq"homo sapiens"[organism] AND H2B\*(Aq \e \& \*(Aq"homo sapiens"[organism] AND MCPH1\*(Aq .Ve .PP Ignoring non-coding and pseudo genes, downloads: genomic sequences with 500 and 200 bp downstream and upstream respectively, using the gene name as filename; transcript and proteins sequences using their accession number as filename; everything in fasta format plus a \s-1CSV\s0 file with search results; saved in a directory named \fIsearch-results\fR .SH "NON-BUGS" .IX Header "NON-BUGS" .IP "\(bu" 4 When supplying options, it's possible to not supply a value and use their default. However, when the expected value is a string, the next argument may be confused as value for the option. For example, when using the following command: .Sp .Vb 2 \& bp_genbank_ref_extractor \-\-transcripts \e \& \*(AqH2A AND homo sapiens\*(Aq .Ve .Sp we mean to search for 'H2A \s-1AND\s0 homo sapiens' saving only the transcripts and using the default as base for the filename. However, the search terms will be interpreted as the base for the filenames (but since it's not a valid identifier, it will return an error). To prevent this, you can either specify the values: .Sp .Vb 2 \& bp_genbank_ref_extractor \-\-transcripts=\*(Aqaccession\*(Aq \e \& \*(AqH2A AND homo sapiens\*(Aq .Ve .Sp or you can use the double hash to stop processing options. Note that this should only be used after the last option. All arguments supplied after the double dash will be interpreted as search terms .Sp .Vb 2 \& bp_genbank_ref_extractor \-\-transcripts \e \& \-\- \*(AqH2A AND homo sapiens\*(Aq .Ve .SH "NOTES ON USAGE" .IX Header "NOTES ON USAGE" .IP "\(bu" 4 Genes that are marked as 'live' and 'protein\-coding' should have at least one transcript. However, This is not always true due to mistakes on annotation. Such cases will throw a warning. When faced with this, be nice and write to the entrez RefSeq maintainers . .IP "\(bu" 4 When creating the directories to save the files, if the directory already exists it will be used and no error or warning will be issued unless \fB\-\-debug\fR as been set. If a non-directory file already exists with that name bp_genbank_ref_extractor exits with an error. .IP "\(bu" 4 On the subject of verbosity, all messages are saved on the log file. The options \&\fB\-\-verbose\fR and \fB\-\-very\-verbose\fR only affect their printing to standard output. Debug messages are different as they will only show up (and be logged) if requested with \fB\-\-debug\fR. .IP "\(bu" 4 When saving a file, to avoid problems with limited filesystems such as \s-1NTFS\s0 or \s-1FAT,\s0 only some characters are allowed. All other characters will be replaced by an underscore. Allowed characters are: .Sp \&\fBa\-z 0\-9 \- + . , () {} []'\fR .IP "\(bu" 4 \&\fBbp_genbank_ref_extractor\fR tries to use the same file extensions that bioperl would expect when saving the file. If unable it will use the '.seq' extension. .SH "FEEDBACK" .IX Header "FEEDBACK" .SS "Mailing lists" .IX Subsection "Mailing lists" User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to the Bioperl mailing list. Your participation is much appreciated. .PP .Vb 2 \& bioperl\-l@bioperl.org \- General discussion \& https://bioperl.org/Support.html \- About the mailing lists .Ve .SS "Support" .IX Subsection "Support" Please direct usage questions or support issues to the mailing list: \&\fIbioperl\-l@bioperl.org\fR rather than to the module maintainer directly. Many experienced and reponsive experts will be able look at the problem and quickly address it. Please include a thorough description of the problem with code and data examples if at all possible. .SS "Reporting bugs" .IX Subsection "Reporting bugs" Report bugs to the Bioperl bug tracking system to help us keep track of the bugs and their resolution. Bug reports can be submitted via the web: .PP .Vb 1 \& https://github.com/bioperl/bio\-eutilities/issues .Ve .SH "AUTHOR" .IX Header "AUTHOR" Carnë Draug .SH "COPYRIGHT" .IX Header "COPYRIGHT" This software is copyright (c) 2011\-2015 by Carnë Draug. .PP This software is available under the \s-1GNU\s0 General Public License, Version 3, June 2007.