从 PDF 文件中提取元数据和 URL

项目描述

Linkrot 徽标

介绍

扫描 pdf 中以纯文本编写的链接，并检查它们是否处于活动状态或返回错误代码。然后它会生成一份关于其发现的报告。从 PDF 中提取参考资料（pdf、url、doi、arxiv）和元数据。

特征

从给定的 PDF 中提取参考和元数据。
检测 pdf、url、arxiv 和 doi 引用。
检查有效的 SSL 证书。
查找损坏的超链接（使用 -c 标志）。
输出为文本或 JSON（使用 -j 标志）。
提取 PDF 文本（使用 --text 标志）。
用作命令行工具或 Python 包。
适用于本地和在线 pdf。

安装

使用 pip 获取代码的副本：

pip install linkrot

用法

linkrot 可用于通过两种方式从 PDF 中提取信息：

命令行/终端工具linkrot
Python 库import linkrot

1.命令行/终端工具

linkrot [pdf-file-or-url]

运行 linkrot -h 查看帮助输出：

linkrot -h

用法：

linkrot [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf

从 PDF 中提取元数据和引用，并可选择下载所有引用的 PDF。

论据

位置参数：

pdf（PDF 文件的文件名或 URL）

可选参数：

-h, --help            (Show this help message and exit)  
-d OUTPUT_DIRECTORY,  --download-pdfs OUTPUT_DIRECTORY (Download all referenced PDFs into specified directory)  
-c, --check-links     (Check for broken links)  
-j, --json            (Output infos as JSON (instead of plain text))  
-v, --verbose         (Print all references (instead of only PDFs))  
-t, --text            (Only extract text (no metadata or references))  
-o OUTPUT_FILE,        --output-file OUTPUT_FILE (Output to specified file instead of console)  
--version             (Show program's version number and exit)

例子

将文本提取到控制台

linkrot https://example.com/example.pdf -t

提取文本到文件

linkrot https://example.com/example.pdf -t -o pdf-text.txt

检查链接

linkrot https://example.com/example.pdf -c

2.主要Python库

导入库：

import linkrot

像这样创建一个 linkrot 类的实例：

pdf = linkrot.linkrot("filename-or-url.pdf") #pdf is the instance of the linkrot class

现在可以使用以下函数从 pdf 中提取特定数据：

获取元数据（）

参数：无

用法：

metadata = pdf.get_metadata() #pdf is the instance of the linkrot class

返回类型：字典<class 'dict'>

提供的信息：所有元数据、与 PDF 相关的秘密元数据，包括创建日期、创建者、标题等...

获取文本（）

参数：无

用法：

text = pdf.get_text() #pdf is the instance of the linkrot class

返回类型：字符串<class 'str'>

提供的信息：字符串形式的 PDF 的全部内容。

get_references(reftype=None, sort=False)

论据：

reftype: The type of reference that is needed 
	 values: 'pdf', 'url', 'doi', 'arxiv'. 
	 default: Provides all reference types.

sort: Whether reference should be sorted or not
      values: True or False. 
      default: Is not sorted.

用法：

references_list = pdf.get_references() #pdf is the instance of the linkrot class

返回类型：<class 'set'>一组<linkrot.backends.Reference object>

linkrot.backends.Reference object has 3 member variables:
- ref: actual URL/PDF/DOI/ARXIV
- reftype: type of reference
- page: page on which it was referenced

提供的信息：所有参考文献及其相应的类型和页码。

get_references_as_dict(reftype=None, sort=False)

论据：

reftype: The type of reference that is needed 
	 values: 'pdf', 'url', 'doi', 'arxiv'. 
	 default: Provides all reference types.

sort: Whether reference should be sorted or not
      values: True or False. 
      default: Is not sorted.

用法：

references_dict = pdf.get_references_as_dict() #pdf is the instance of the linkrot class

返回类型：<class 'dict'>带有键 'pdf'、'url'、'doi'、'arxiv' 的字典，每个键都有一个<class 'list'>该类型的引用列表。

提供的信息：相应类型列表中的所有参考文献。

下载_pdfs(target_dir)

论据：

target_dir: The path of the directory to which the reference pdfs should be downloaded

用法：

pdf.download_pdfs("target-directory") #pdf is the instance of the linkrot class

返回类型：无

提供的信息：将所有参考 pdf 下载到指定目录。

3. Linkrot 下载器功能

进口：

from linkrot.downloader import sanitize_url, get_status_code, check_refs

sanitize_url（网址）

论据：

url: The url to be sanitized.

用法：

new_url = sanitize_url(old_url)

返回类型：字符串<class 'str'>

提供的信息：如果之前不是 URL，则 URL 以“http://”为前缀，并确保它是 utf-8 格式。

获取状态代码（网址）

论据：

url: The url to be checked for its status.

用法：

status_code = get_status_code(url)

返回类型：字符串<class 'str'>

提供的信息：检查 url 是活动的还是损坏的。

check_refs(refs, 详细=True, max_threads=MAX_THREADS_DEFAULT)

论据：

refs: set of linkrot.backends.Reference objects
verbose: whether it should print every reference with its code or just the summary of the link checker
max_threads: number of threads for multithreading

用法：

check_refs(pdf.get_references()) #pdf is the instance of the linkrot class

返回类型：无

提供的信息：打印参考及其状态代码和终端上所有断开/活动链接的摘要。

4. Linkrot 提取器功能

进口：

from linkrot.extractor import extract_urls, extract_doi, extract_arxiv

获取pdf文本：

text = pdf.get_text() #pdf is the instance of the linkrot class

extract_urls（文本）

论据：

text: String of text to extract urls from

用法：

urls = extract_urls(text)

返回类型：一组<class 'set'>URL

提供的信息：文本中的所有 URL

extract_arxiv（文本）

论据：

text: String of text to extract arxivs from

用法：

arxiv = extract_arxiv(text)

返回类型：<class 'set'>arxivs 集合

提供的信息：文本中的所有 arxivs

extract_doi（文本）

论据：

text: String of text to extract dois from

用法：

doi = extract_doi(text)

返回类型：<class 'set'>dois集合

提供信息：文中的所有dois

行为守则

要查看我们的行为准则，请访问我们的行为准则页面。

执照

该程序已获得MIT 许可证。

项目详情

发布历史发布通知| RSS订阅

这个版本

3.9

2022 年 9 月 25 日

3.8.8

2022 年 8 月 2 日

3.8.7

2022 年 8 月 2 日

3.8.6

2022 年 8 月 2 日

3.8.5

2022 年 8 月 2 日

3.8.4

2022 年 8 月 1 日

3.8.3

2022 年 7 月 31 日

3.8.2

2022 年 7 月 31 日

3.8.1

2022 年 7 月 31 日

3.8

2022 年 7 月 31 日

3.7

2022 年 7 月 30 日

3.6 猛拉

2022 年 7 月 30 日

3.5

2022 年 6 月 1 日

3.4

2021 年 12 月 11 日

3.3

2021 年 12 月 11 日

3.2

2021 年 12 月 8 日

3.1

2021 年 12 月 8 日

3.0

2021 年 11 月 28 日

2.92

2021 年 11 月 28 日

2.91

2021 年 11 月 25 日

2.9

2021 年 11 月 23 日

2.8

2021 年 11 月 23 日

2.7

2021 年 11 月 22 日

2.6

2021 年 11 月 14 日

2.3

2021 年 10 月 24 日

2.2

2021 年 10 月 11 日

2.1.1

2021 年 10 月 11 日

2.1

2021 年 10 月 11 日

2.0.1

2021 年 10 月 5 日

2.0

2021 年 10 月 3 日

1.11

2021 年 9 月 6 日

1.1

2021 年 9 月 5 日

1.0

2021 年 8 月 22 日

0.999

2021 年 8 月 22 日

0.986 猛拉

2021 年 8 月 16 日

0.99

2021 年 8 月 16 日

0.0.1

2021 年 8 月 13 日

下载文件

下载适用于您平台的文件。如果您不确定要选择哪个，请了解有关安装包的更多信息。

源分布

linkrot-3.9.tar.gz (18.3 kB 查看哈希)

已上传 2022 年 9 月 25 日 source

内置分布

linkrot-3.9-py3-none-any.whl （18.6 kB 查看哈希）

已上传 2022 年 9 月 25 日 py3

linkrot -3.9.tar.gz 的哈希值

linkrot-3.9.tar.gz 的哈希值
算法	哈希摘要
SHA256	`e5ec84f3968f9afab2af0fe83ea463dc3dcef72fc595d0b4b7524c8740f3329d`
MD5	`04d826952f2b78593dbf3bb6370cfdaa`
布莱克2-256	`b16d35a67610f05a3ced82c7c5ae65950392a93729bf07e8b4193cc33ecf9af9`

linkrot -3.9-py3-none-any.whl 的哈希值

linkrot-3.9-py3-none-any.whl 的哈希值
算法	哈希摘要
SHA256	`390683ff4ce749b11fd2c32f66b2228abc07a17e111cf7b23164e524d3f7786d`
MD5	`d685c1a38a7ede9f25498980b1fca0e0`
布莱克2-256	`b9dc6bb13208da3efcaf5f6c39a79ffdc9a9baa1577a2023f5566018e065a005`

linkrot 3.9

导航

项目链接

统计数据

Meta

Maintainers

分类

项目描述

介绍

特征

安装

用法

1.命令行/终端工具

论据

位置参数：

可选参数：

例子

将文本提取到控制台

提取文本到文件

检查链接

2.主要Python库

获取元数据（）

获取文本（）

get_references(reftype=None, sort=False)

get_references_as_dict(reftype=None, sort=False)

下载_pdfs(target_dir)

3. Linkrot 下载器功能

sanitize_url（网址）

获取状态代码（网址）

check_refs(refs, 详细=True, max_threads=MAX_THREADS_DEFAULT)

4. Linkrot 提取器功能

extract_urls（文本）

extract_arxiv（文本）

extract_doi（文本）

行为守则

执照

项目详情

项目链接

统计数据

元

维护者

分类器

发布历史 发布通知| RSS订阅

下载文件

源分布

内置分布

发布历史发布通知| RSS订阅