实用短串联重复序列 (STR)
项目描述
结构分析
该软件包包含用于分析短串联重复 (STR) 的脚本和实用程序。
安装
要安装脚本和实用程序,请运行:
python3 -m pip install --upgrade str_analysis
或者,您可以使用此 docker 映像:
docker run -it weisburd/str-analysis:latest
call_non_ref_pathogenic_motifs
该脚本采用 bam 或 cram 文件,并确定哪些基序存在于已知的致病性 STR 基因座(如 RFC1、BEAN1、DAB1 等),其中已知几个基序在群体中分离。然后它可以选择运行 ExpansionHunterDenovo、ExpansionHunter 和/或 STRling,并从其输出中收集相关字段,然后用户可以进行比较或用于下游分析。它还可以运行 REViewer 以根据 ExpansionHunter 输出生成读取可视化图像。最后,它为每个基因座生成一个 json 文件,其中包含所有收集到的信息以及一个指示是否检测到致病基序的“调用”字段。
示例命令行:
# basic command
call_non_ref_pathogenic_motifs -R hg38.fasta -g 38 sample1.cram --locus RFC1
# run ExpansionHunter and REViewer on all 9 loci with known non-ref pathogenic motifs
call_non_ref_pathogenic_motifs -R hg38.fasta --run-expansion-hunter --run-reviewer -g 38 sample1.cram --all-loci
# for 2 specific loci, run ExpansionHunter and REViewer + provide an existing ExpansionHunterDenovo profile
call_non_ref_pathogenic_motifs -R hg38.fasta --run-expansion-hunter --run-reviewer --ehdn-profile sample1.str_profile.json -g 38 sample1.cram --locus RFC1 --locus BEAN1
命令行参数:
positional arguments:
bam_or_cram_path bam or cram path.
optional arguments:
-h, --help show this help message and exit
-g {GRCh37,hg19,hg37,37,GRCh38,hg38,38}, --genome-version {GRCh37,hg19,hg37,37,GRCh38,hg38,38}
Reference genome version
-r REFERENCE_FASTA, --reference-fasta REFERENCE_FASTA
Reference fasta path.
-o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX
Output filename prefix.
-s SAMPLE_ID, --sample-id SAMPLE_ID
The sample id to put in the output json file. If not
specified, it will be retrieved from the bam/cram
header or filename prefix.
--strling-genotype-table STRLING_GENOTYPE_TABLE
Optionally provide an existing STRling output file for
this sample. If specified, the script will skip
running STRling.
--run-strling Optionally run STRling and copy information relevant
to the locus from the STRling results to the json
output file.
--strling-path STRLING_PATH
The path of the STRling executable to use.
--strling-reference-index STRLING_REFERENCE_INDEX
Optionally provide the path of a pre-computed STRling
reference index file. If provided, it will save a step
and allow STRling to complete faster.
--expansion-hunter-denovo-profile EXPANSION_HUNTER_DENOVO_PROFILE
Optionally copy information relevant to the locus from
this ExpansionHunterDenovo profile to the output json.
This is instead of --run-expansion-hunter-denovo.
--run-expansion-hunter-denovo
Optionally run ExpansionHunterDenovo and copy
information relevant to the locus from
ExpansionHunterDenovo results to the output json.
--expansion-hunter-denovo-path EXPANSION_HUNTER_DENOVO_PATH
The path of the ExpansionHunterDenovo executable to
use if --run-expansion-hunter-denovo is specified.
--run-expansion-hunter
If this option is specified, this script will run
ExpansionHunter once for each of the motif(s) it
detects at the locus. ExpansionHunter doesn't
currently support genotyping multiallelic repeats such
as RFC1 where an individual may have 2 alleles with
motifs that differ from each other (and from the
reference motif). Running ExpansionHunter separately
for each motif provides a work-around.
--expansion-hunter-path EXPANSION_HUNTER_PATH
The path of the ExpansionHunter executable to use if
--run-expansion-hunter is specified. This must be
ExpansionHunter version 3 or greater.
--use-offtarget-regions
Optionally use off-target regions when counting reads
that support a motif, and when running
ExpansionHunter.
--run-reviewer Run the REViewer tool to visualize ExpansionHunter
output. --run-expansion-hunter must also be specified.
--run-reviewer-for-pathogenic-calls
Run the REViewer tool to visualize ExpansionHunter
output only when this script calls a sample as having
PATHOGENIC MOTIF / PATHOGENIC MOTIF. --run-expansion-
hunter must also be specified.
--all-loci Generate calls for all these loci: RFC1, BEAN1, DAB1,
MARCHF6, RAPGEF2, SAMD12, STARD7, TNRC6A, YEATS2
-l {RFC1,BEAN1,DAB1,MARCHF6,RAPGEF2,SAMD12,STARD7,TNRC6A,YEATS2}, --locus {RFC1,BEAN1,DAB1,MARCHF6,RAPGEF2,SAMD12,STARD7,TNRC6A,YEATS2}
Generate calls for this specific locus. This argument
can be specified more than once to call multiple loci.
-v, --verbose Print detailed log messages.
命令行参数:
positional arguments:
bam_or_cram_path bam or cram path
optional arguments:
-h, --help show this help message and exit
-g {GRCh37,hg19,hg37,37,GRCh38,hg38,38}, --genome-version {GRCh37,hg19,hg37,37,GRCh38,hg38,38}
-R REFERENCE_FASTA, --reference-fasta REFERENCE_FASTA
Reference fasta path. The reference fasta is sometimes
necessary for decoding cram files.
-o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX
Output filename prefix
-s SAMPLE_ID, --sample-id SAMPLE_ID
The sample id to put in the output json file. If not
specified, it will be retrieved from the bam/cram
header or filename prefix.
--run-expansion-hunter-denovo
Optionally run ExpansionHunterDenovo and copy
information relevant to the locus from
ExpansionHunterDenovo results to the output json.
--expansion-hunter-denovo-path EXPANSION_HUNTER_DENOVO_PATH
The path of the ExpansionHunterDenovo executable to
use when --expansion-hunter-denovo-path is specified.
--expansion-hunter-denovo-profile EXPANSION_HUNTER_DENOVO_PROFILE
Optionally copy information relevant to the locus from
this ExpansionHunterDenovo profile to the output json.
This is instead of --run-expansion-hunter-denovo
-r, --run-expansion-hunter
If this option is specified, this script will run
ExpansionHunter once for each of the motif(s) it
detects at the locus. ExpansionHunter doesn't
currently support genotyping multiallelic repeats such
as RFC1 where an individual may have 2 alleles with
motifs that differ from eachother (and from the
reference motif). Running ExpansionHunter separately
for each motif provides a work-around.
--expansion-hunter-path EXPANSION_HUNTER_PATH
The path of the ExpansionHunter executable to use if
-r is specified. This must be ExpansionHunter version
3 or greater.
--use-offtarget-regions
Optionally use off-target regions when counting reads
that support a motif, and when running ExpansionHunter.
--all-loci Generate calls for all these loci: RFC1, BEAN1, DAB1,
MARCHF6, RAPGEF2, SAMD12, STARD7, TNRC6A, YEATS2
-l {RFC1,BEAN1,DAB1,MARCHF6,RAPGEF2,SAMD12,STARD7,TNRC6A,YEATS2}, --locus {RFC1,BEAN1,DAB1,MARCHF6,RAPGEF2,SAMD12,STARD7,TNRC6A,YEATS2}
Generate calls for this specific locus. This argument
can be specified more than once to call multiple loci.
--run-reviewer Run the REViewer tool to visualize ExpansionHunter
output. --run-expansion-hunter must also be specified.
--run-reviewer-for-pathogenic-calls
Run the REViewer tool to visualize ExpansionHunter
output only when this script calls a sample as having
PATHOGENIC MOTIF / PATHOGENIC MOTIF. --run-expansion-
hunter must also be specified.
-v, --verbose Print detailed log messages
*_motifs.json
输出文件摘要:
该call_non_ref_pathogenic_motifs
脚本输出一个 .json 文件,其中包含许多字段,总结了它找到的内容,以及使用 、 和/或 args 时的 ExpansionHunter、REViewer、ExpansionHunterDenovo--run-expansion-hunter
和--run-reviewer
STRling
--run-expansion-hunter-denovo
输出--run-strling
。
关键字段是:
呼叫:(例如BENIGN MOTIF / PATHOGENIC MOTIF
)
motif1_repeat_unit:(例如AAAGG
)
motif2_repeat_unit:(例如AAGGG
)
expansion_hunter_call_repeat_unit:(例如AAAGG / AAGGG
)与上述相同的信息,但采用基因型形式。
expand_hunter_call_genotype:(例如15/68
)上述主题的重复次数。
expand_hunter_call_CI:(例如15-15/55-87
)基因型的置信区间。
expand_hunter_call_reviewer_svg:(例如sample1.RFC1_AAGGG.expansion_hunter_reviewer.svg
)REViewer 读取可视化图像路径。
输出字段:
sample_id:如果此值未指定为命令行 arg,则从输入的 bam/cram 文件头或文件名前缀中解析。
call:描述在 RFC1/CANVAS 轨迹上检测到的基序。它的格式类似于 VCF 基因型。可能的值为:
PATHOGENIC MOTIF / PATHOGENIC MOTIF
:仅检测到致病基序BENIGN MOTIF / BENIGN MOTIF
:仅检测到良性基序MOTIF OF UNCERTAIN SIGNIFICANCE / MOTIF OF UNCERTAIN SIGNIFICANCE
:检测到具有未知致病性的非规范基序BENIGN MOTIF / PATHOGENIC MOTIF
:良性基序和致病基序的杂合子,暗示携带者状态PATHOGENIC MOTIF / MOTIF OF UNCERTAIN SIGNIFICANCE
:杂合的致病基序和检测到的具有未知致病性的非规范基序BENIGN MOTIF / MOTIF OF UNCERTAIN SIGNIFICANCE
:杂合的良性基序和检测到致病性未知的非规范基序NO CALL
:读取数据中没有足够的证据支持上述任何选项
主题1_repeat_unit:最多读取支持的重复单元。
motif1_read_count:支持motif1的读取数。
motif1_normalized_read_count:与motif1_read_count 相同,但通过RFC1 基因座侧翼区域的覆盖深度进行归一化
motif1_n_occurrences:motif1 在RFC1 基因座的读取中出现的总次数。
motif1_read_count_with_offtargets:此重复单元的脱靶区域内支持motif1 的读取数。这些是大约 1kb 的区域,根据模拟数据的实验,完全重复(又名 IRR)读取可能会错误映射到这些区域。
主题1_normalized_read_count_with_offtargets:与motif1_read_count_with_offtargets相同,但通过RFC1基因座侧翼区域的覆盖深度进行归一化
motif2_repeat_unit:如果检测到多个主题,此字段将包含下一个最多读取支持的重复单元
motif2_read_count:请参阅“motif1_read_count”描述。
主题 2_n_occurrences:参见“motif1_n_occurrences”描述。
...
注意: motif2_* 字段只有在读取支持超过 1 个主题时才会生成。
left_flank_coverage:紧靠 RFC1 轨迹左侧的 2kb 窗口内的平均读取深度
right_flank_coverage:紧靠 RFC1 轨迹右侧的 2kb 窗口内的平均读取深度
found_n_reads_overlap_rfc1_locus:在 RFC1 基因座处与参考基因组中的 AAAAG 重复重复且 MAPQ > 2
的读取数以及一些 5bp 或 6bp 重复基序,覆盖 > 70% 的重叠读取序列(包括任何软剪辑碱基)
found_repeats_in_fraction_of_reads:found_repeats_in_n_reads
/found_n_reads_overlap_rfc1_locus
注意:下面的expansion_hunter_* 字段将在--run-expansion-hunter
使用时添加。
expand_hunter_motif1_json_output_file在主题 1 上运行时,ExpansionHunter 输出 json 文件的路径。
expand_hunter_motif1_repeat_unit重复单元传递给 ExpansionHunter。
expand_hunter_motif1_short_allele_genotype ExpansionHunter 输出短等位基因的基因型(重复数)。
expand_hunter_motif1_long_allele_genotype ExpansionHunter 输出长等位基因的基因型(重复数)。
expand_hunter_motif1_short_allele_CI_start ExpansionHunter 输出短等位基因的基因型置信区间下限。
expand_hunter_motif1_short_allele_CI_end ExpansionHunter 输出短等位基因的基因型置信区间上限。
expand_hunter_motif1_long_allele_CI_start ExpansionHunter 输出长等位基因的基因型置信区间下限。
expand_hunter_motif1_long_allele_CI_end ExpansionHunter 输出长等位基因的基因型置信区间上限。
expand_hunter_motif1_total_spanning_reads ExpansionHunter 输出支持motif1 基因型的跨越读取总数。
expand_hunter_motif1_total_flanking_reads ExpansionHunter 输出支持motif1基因型的侧翼读取总数。
expand_hunter_motif1_total_inrepeat_reads ExpansionHunter 输出支持motif1 基因型的IRR 读取总数。
expand_hunter_motif2_json_output_file如果在该位点检测到第二个基序,这将是在第二个基序上运行时 ExpansionHunter 输出 json 文件的路径。上面为motif1 列出的所有字段也将出现在motif2 中。...
expand_hunter_call_repeat_unit用于运行 ExpansionHunter 的重复单元。如果检测到多个,则格式为AAAAG / AAGGG
.
expand_hunter_call_genotype ExpansionHunter 基于 ExpansionHunter 运行结果的组合基因型。
expand_hunter_call_CI ExpansionHunter 输出支持motif1 基因型的跨越读取总数。
注意:下面的 *_reviewer_svg 字段仅在--run-reviewer
使用时添加。
expand_hunter_motif1_reviewer_svg REViewer 为这个主题生成的 .svg 图像文件的路径。
expand_hunter_motif2_reviewer_svg如果在该位点检测到第二个基序,这将是 REViewer 为该其他基序生成的 .svg 图像文件的路径。
expand_hunter_call_reviewer_svg最终的 .svg 图像文件。如果检测到多个基序,这将包含一个合并图像,其中包含从上面的主题 1 和主题 2 图像中选择的 1 个短等位基因面板和 1 个长等位基因面板。
注意:下面的 ehn_ 字段仅在--run-expansion-hunter-denovo
使用时添加。
ehdn_motif1_repeat_unit
ehdn_motif1_anchored_irr_count
ehdn_motif1_paired_irr_count
ehdn_motif1_total_irr_count
ehdn_sample_read_depth
ehdn_motif1_n_anchored_regions
示例输出:
假设脚本检测到 sample1 在 RFC1 基因座处包含两组读取 - 一些具有 AAGGG 基序,一些具有 AAAAG。该脚本然后为 AAGGG 主题运行 ExpansionHunter,然后为 AAAAG 主题再次运行它。假设 ExpansionHunter 输出 15/73 作为 AAGGG 的基因型和 15/22 作为 AAAAG 的基因型。然后该脚本将输出:
呼叫:(BENIGN MOTIF / PATHOGENIC MOTIF
)
expansion_hunter_call_repeat_unit:(AAAAG / AAGGG
)
expansion_hunter_call_genotype:(15/73
)从上面的两个基因型中选择。
expand_hunter_call_CI : ( 15-15/55-87
) 从两组置信区间中选择。
expand_hunter_call_reviewer_svg ( sample1.RFC1_AAGGG.expansion_hunter_reviewer.svg
)
项目详情
下载文件
下载适用于您平台的文件。如果您不确定要选择哪个,请了解有关安装包的更多信息。
源分布
内置分布
str_analysis -0.9.7.tar.gz 的哈希值
算法 | 哈希摘要 | |
---|---|---|
SHA256 | e7b3963bf64405cf789199037868bc55aedd7ff90adfe05cc2dc12872329f16b |
|
MD5 | 92e56d5ccd3b4c7fd4ce443a9fa2e192 |
|
布莱克2-256 | 1a51797366d79e10788dea9a869ec3cc8b9a9dfdd2975976bdc2432180e5d463 |
str_analysis -0.9.7-py3-none-any.whl 的哈希值
算法 | 哈希摘要 | |
---|---|---|
SHA256 | 4a9bcd745ab5439a203919776fe5bc4ba9f6c6973152be36f97e309f9291787b |
|
MD5 | 202f7a932cfd650b01316b223984c975 |
|
布莱克2-256 | 89c2afdc7eea216ff2b4e69ee55d741b43e6b8df6a6ab92e3385fd3f2319f31f |