自动信息抽取系统

本文包括用于文本识别的 OCR、用于信息提取的语言模型和 NER，以及用于特定数据模式匹配和填写表单的正则表达式/规则。

1. 文本数据提取

格式：基于文本的 PDF、基于图像的 PDF、图像

为了有效地从这些格式中提取文本，我建议使用 Python 中的 PDF2 包。

# Import library
from PyPDF2 import PdfReader

# Open the PDF file
pdf_file = PdfReader(open("data/sample.pdf", "rb"))

# Read all the pages in the PDF
pages = [pdf_file.pages[i] for i in range(len(pdf_file.pages))]

# Join all the pages into a single string
text = '\n'.join([page.extract_text() for page in pages])

在本例中，我们将所有文本保存在一个名为“text”的字符串中。

2. 信息采集

假设我们只需要从此演示的数据中获取 5 个元素：first_name, last_name, address, phone, date_of_birth.

正则表达式方法

这种传统方法存在一些缺点，因为它只适用于结构化输入数据，并且搜索列表可能会用尽。

# Import Regular Expression
import re

# Create empty lists to store our data
first_names = []
last_names = []
addresses = []
phones = []
dates_of_birth = []

# Define a function to capture the information from text file using Regular Expression
def extract_info_1(text, first_names, last_names, addresses, phones, dates_of_birth):
    # Use regular expressions to search for the relevant information in the text
    Address_keys = ["Location", "Located at", "Address", "Residence", "Premises", "Residential address"]
    BOD_keys = ["Born on", "DOB", "Birth date", "Date of birth"]
    Phone_keys = ["Phone", "Telephone", "Contact number", "Call at", "Phone number", "Mobile number"]
    First_Name_keys = ["First name", "Given name", "First", "Given", "Tenant", "First and last name"]
    Last_Name_keys = ["Last name", "Family name", "Surname", "Last"]

    for keyword in Address_keys:
        matches = re.findall(keyword + "\s*:\s*(.*)", text, re.IGNORECASE)
        if matches:
            addresses.extend(matches)

    for keyword in BOD_keys:
        matches = re.findall(keyword + "\s*:\s*(.*)", text, re.IGNORECASE)
        if matches:
            dates_of_birth.extend(matches)

    for keyword in Phone_keys:
        matches = re.findall(keyword + "\s*:\s*(.*)", text, re.IGNORECASE)
        if matches:
            phones.extend(matches)

    for keyword in First_Name_keys:
        matches = re.findall(keyword + "\s*:\s*(.*)", text, re.IGNORECASE)
        if matches:
            first_names.extend(matches)

    for keyword in Last_Name_keys:
        matches = re.findall(keyword + "\s*:\s*(.*)", text, re.IGNORECASE)
        if matches:
            last_names.extend(matches)

# Apply function
extract_info_1(text, first_names, last_names, addresses, phones, dates_of_birth)

命名实体识别模型

在 spaCy 的 NER（命名实体识别）模型中，实体被分类为各种类型，以识别和标记文本中不同类型的命名实体。spaCy的NER模型识别的实体类型包括但不限于：

PERSON、ORG、LOC、日期、TIME、MONEY、PERCENT、数量、基数、产品、EVENT、LANGUAGE、法律、WORK_OF_ART、PHONE

GPE：例如国家、城市和州。

NORP：国籍、宗教或政治团体。

FAC：设施、建筑物或构筑物的名称。

import spacy

# Load the large English NER model
nlp = spacy.load("en_core_web_sm")

# Define a function to capture the information from text using Named Entity Recognition
def extract_info_3(text, first_names, last_names, addresses, phones, dates_of_birth):
    # Initialize variables to store extracted information
    first_name = ""
    last_name = ""
    address = ""
    phone = ""
    date_of_birth = ""

    # Process the text with spaCy NER model
    doc = nlp(text)

    # Extract the information using Named Entity Recognition
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            # Check if the entity text is a first name
            if not first_name:
                first_name = ent.text.strip()
            else:
                # If we already have a first name, assume the current entity is the last name
                last_name = ent.text.strip()
        elif ent.label_ == "GPE":
            # GPE represents geographical entities, which could include addresses
            address = ent.text.strip()
        elif ent.label_ == "PHONE":
            # PHONE entity type (custom) for phone numbers
            phone = ent.text.strip()
        elif ent.label_ == "DATE":
            # DATE entity type for dates, which could include date of birth
            date_of_birth = ent.text.strip()

    # Append the extracted information to the pre-defined lists
    first_names.append(first_name)
    last_names.append(last_name)
    addresses.append(address)
    phones.append(phone)
    dates_of_birth.append(date_of_birth)

In-Context Learning Model

# 1. Import the necessary libraries
import openai
import os
import pandas as pd
import time

# 2. Set your API from ChatGPT account
openai.api_key = '<YOUR API KEY>'

# 3. Set up a function to get the result
def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0,
        )
    return response.choices[0].message["content"]

# 4. Create a promt from text file and our words:
question = "Read and understand the document, then help me extract 5 pieces of information including (1) First name, (2) last name, (3) date of birth, (4) address, (5) phone number  of tenants only. Here is the content of the document: ".join(text)

# 5. Query the API
response = get_completion(question)
print(response)

LayoutLMv3

LayoutLMv3 是一个预训练模型，它是在包含 1100 万个文档图像及其相应文本嵌入的海量数据集上训练的。数据集包含多种单据类型，包括发票、收据、合同和病历。出于这个原因，在训练模型时，我们保留了大多数层的所有权重，只需要微调、解冻网络中的最后几层。我们训练得越多，我们就不会得到太多的准确性，唯一需要的就是有一个好的输入数据集。

LayoutLMv3 是一个强大的语言模型，擅长理解（1）文档的结构（位置）和（2）内容。将其视为一种高度先进的工具，不仅可以识别单词，还可以理解它们在该文档中的位置以及（3）文档中的关系。在处理各种文档格式（包括 PDF、图像和表单）时，此功能至关重要。可以针对各种文档分析任务（如文档分类、NER 和问答）对 LayoutLMv3 模型进行微调。

LayoutLMv3 体系结构类似于 Transformer 体系结构，但增加了一些组件：

文本嵌入层（文档内容）：此层将文档的文本转换为数字表示形式。上面的蓝色区域。

图像嵌入层（文本/信息的位置）：该层将文档的图像转换为数字表示。上面的橙色区域。

跨模态注意力层（文本与其位置的关系）：该层允许模型学习文本和图像嵌入之间的关系。上面的紫色区域。

LayoutLMv3 的一些用例：

Document classification:
文档分类：将文档分类为不同的类别，例如发票、收据、合同和病历。
Named entity recognition (NER):
命名实体识别（NER）：从文档（如人员、地点、组织和产品）中识别和提取命名实体。
Question answering:
问题解答：回答有关文件内容的问题，例如“此发票的总价是多少？”或“此病历中患者的诊断是什么？
Document visual question answering:
文档可视化问答：回答有关文档可视化布局的问题，例如“客户在此合同上的签名在哪里？”或“此目录中此产品的价格是多少？
Form understanding:
表单理解：从表单中提取信息，例如贷款申请、保险索赔和纳税申报表。
Receipt processing:
收据处理：从收据中提取信息，例如购买日期和时间、购买的商品以及花费的总金额。

Invoice processing:
发票处理：从发票中提取信息，例如供应商名称、发票编号、到期日期和行项目。
Contract analysis:
合同分析：从合同中提取关键信息，例如相关方、合同条款和双方签名。
Medical record analysis:
病历分析：从病历中提取关键信息，例如患者的人口统计数据、病史和治疗计划。
E-discovery:
电子取证：从大量电子数据中识别和提取相关文档。
Fraud detection:
欺诈检测：识别欺诈性文件，例如假发票和伪造支票。
Customer support:
客户支持：回答客户有关其文档和帐户的问题。
Research:
研究：对新闻文章、科学论文和历史记录等文件进行研究。

LayoutLMv3 可用于各种行业，包括金融、医疗保健、法律、保险和零售。它是一个强大的工具，可以帮助企业自动化他们的文档处理任务，改善他们的客户服务，并降低他们的欺诈风险。

注释工具

Piaf : Free, link: https://github.com/etalab/piaf
PDF24 Tools: Free, link: https://tools.pdf24.org/en/annotate-pdf
Label Studio: Free Trial: link: https://labelstud.io/guide/get_started.html
Tagtog Annotation Tool: For Business: link: https://docs.tagtog.com/pdf-annotation-tool.html
UbiAi: For Business: link: https://ubiai.tools/ . Tutorial for this tool can be found here: https://www.youtube.com/watch?v=r1aoFj974FU&ab_channel=KarndeepSingh
Markup Hero: For business: https://markuphero.com/try/annotate-pdf.html

微调训练流程见后文。

3. 表格自动填写

演示如何使用 Selenium 从外部填写表单进行网页浏览和填充。

from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time

# Load the website
web = webdriver.Chrome()
web.get('https://secure.sonnet.ca/#/quoting/property/about_you?lang=en')

# Wating for the web to load before filling out
time.sleep(5)

# Inputs field
#ADDRESS
Address_input = "50 Laughton Ave M6N 2W9" # KEY
Address_fill = web.find_element("xpath",'//*[@id="addressInput"]')
Address_fill.send_keys(Address_input)

# FIRST NAME
FirstName_input = "Kiel" # KEY
FirstName_fill = web.find_element("xpath",'//*[@id="firstName"]')
FirstName_fill.send_keys(FirstName_input)

# LAST NAME
LastName_input = "Dang" # KEY
LastName_fill = web.find_element("xpath",'//*[@id="lastName"]')
LastName_fill.send_keys(LastName_input)

# MONTH OF BIRTH
dropdown_month = web.find_element("id","month-0Button")
dropdown_month.click()
option = web.find_element("xpath", "//span[contains(text(), 'January')]") #KEY
option.click()

# DATE OF BIRTH
Date_input = "23" # KEY
Date_fill = web.find_element("xpath",'//*[@id="date-0"]')
Date_fill.send_keys(Date_input)

# YEAR OF BIRTH
Year_input = "1994" # KEY
Year_fill = web.find_element("xpath",'//*[@id="year-0"]')
Year_fill.send_keys(Year_input)

# Prevent auto closing the web after application finishes the script.
input("Press enter to close the browser")
web.quit()

4. 微调

构建带注释的数据集

pip install docai-py

https://github.com/butlerlabs/docai

定义要从文档中提取的信息，例如：

First Name (we’ll include Middle Initial here as well)
名字（我们在这里也包括中间名首字母）
Last Name 姓
Address 地址
Date of Birth 出生日期
Expiration Date 有效期
Driver’s License Number 驾照号码

要进行批注，请确保选择了正确的字段，只需在文档中的文本周围拖动一个框即可。如果您愿意，也可以单击单个文本片段。

将批注转换为 LayoutLM 格式：

# Download annotations from Butler using docai library
from docai.annotations import AnnotationClient
from docai.generated.models import ModelTrainingDocumentStatus

# Get your API Key from Butler
API_KEY = "MY_API_KEY"

# Specify the id of the model that you annotated your documents in
MODEL_ID = "MY_MODEL_ID"

# Load annotations from Butler
butler_client = AnnotationClient(API_KEY)
annotations = butler_client.load_annotations(
    model_id=MODEL_ID, 
    load_all_pages=True,
    document_status=ModelTrainingDocumentStatus.LABELED
)
print("Loaded {} annotations".format(len(annotations.training_documents)))

MY_API_KEY和 MY_MODEL_ID 参考链接：

https://docs.butlerlabs.ai/reference/authentication

https://docs.butlerlabs.ai/reference/finding-a-model-id

我们需要转换为转换器库和 LayoutLMv3 更易于使用的格式。

from docai.annotations import normalize_ner_annotation_for_layoutlm

# Convert annotations into NER format so they can be used
# to train LayoutLMv3 with Hugging Face
annotations_as_ner = annotations.as_ner(as_iob2=True)

# Normalize NER annotations by 1000 to match LayoutLM expected bounding box format
annotations_as_ner = list(map(normalize_ner_annotation_for_layoutlm, annotations_as_ner))

我们将注解转换为通用的NER格式。load_annotations功能目前不支持多页文档。仅包含多页文档的第一页。以 NER 格式加载后，我们对边界框进行规范化，使它们介于 0 和 1000 之间。

我们将注解加载到 Hugging Face Dataset 对象中：

# Create Hugging Face Dataset
from datasets import Dataset
dataset = Dataset.from_list(annotations_as_ner)
print(dataset)

自定义数据集上的训练

# First, lets create a few helper variables for use below
from datasets.features import ClassLabel
from docai.annotations import get_ner_tags_for_model

model_ner_tags = get_ner_tags_for_model(annotations.model_details)
print(f'Tags: {model_ner_tags["tags"]}')

label_list = model_ner_tags["tags"]
class_label = ClassLabel(names=label_list)
id2label = {k: v for k,v in enumerate(label_list)}
label2id = {v: k for k,v in enumerate(label_list)}
column_names = dataset.column_names

# Split dataset into train/test
dataset = dataset.train_test_split(test_size=0.1)

将数据集转换为 LayoutLM 格式

首先从 Hugging Face 集线器加载 layoutlmv3-base 处理器：

# Load the microsoft/layoutlmv3-base processor from the Hugging Face hub
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)

def convert_ner_tags_to_id(ner_tags):
  return [label2id[ner_tag] for ner_tag in ner_tags]

# This function is used to put the Dataset in its final format for training LayoutLM
def prepare_dataset(annotations):
    images = annotations['image']
    words = annotations['tokens']
    boxes = annotations['bboxes']
    # Map over labels and convert to numeric id for each ner_tag
    ner_tags = [convert_ner_tags_to_id(ner_tags) for ner_tags in annotations['ner_tags']]

    encoding = processor(images, words, boxes=boxes, word_labels=ner_tags, truncation=True, padding="max_length")

    return encoding

然后准备训练数据集和评估数据集：

from datasets import Features, Sequence, ClassLabel, Value, Array2D, Array3D

# Define features for use training the model 
features = Features({
    'pixel_values': Array3D(dtype="float32", shape=(3, 224, 224)),
    'input_ids': Sequence(feature=Value(dtype='int64')),
    'attention_mask': Sequence(Value(dtype='int64')),
    'bbox': Array2D(dtype="int64", shape=(512, 4)),
    'labels': Sequence(feature=Value(dtype='int64')),
})

# Prepare our train & eval dataset
train_dataset = dataset["train"].map(
    prepare_dataset,
    batched=True,
    remove_columns=column_names,
    features=features,
)
eval_dataset = dataset["test"].map(
    prepare_dataset,
    batched=True,
    remove_columns=column_names,
    features=features,
)

定义评估指标

from docai.training import generate_layoutlm_compute_eval_metric_fn

# Use this utility from the docai SDK to create a function that can
# be used to calculate the evaluation metrics while training
compute_eval_metrics = generate_layoutlm_compute_eval_metric_fn(
    ner_labels=label_list,
    metric_name="seqeval",
    return_entity_level_metrics=False
)

训练模型

'''
Define our model, as well as the TrainingArguments which includes all the 
hyperparameters related to training.
'''

from transformers import TrainingArguments, Trainer
from transformers import LayoutLMv3ForTokenClassification
from transformers.data.data_collator import default_data_collator

MODEL_NAME = 'us_dl_model'

model = LayoutLMv3ForTokenClassification.from_pretrained(
    "microsoft/layoutlmv3-base",
    id2label=id2label,
    label2id=label2id)

training_args = TrainingArguments(
    output_dir=MODEL_NAME,
    max_steps=1000,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    learning_rate=1e-5,
    evaluation_strategy="steps",
    eval_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="f1")

# Initialize our Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=processor,
    data_collator=default_data_collator,
    compute_metrics=compute_eval_metrics,
)

trainer.train()

训练完成后，应该会看到有关它在 eval 数据集上的表现的指标：

可以将模型发布到 Hugging Face 中心以供将来使用：

# If you want, log in to the hugging face hub and push your
# model for future use
from huggingface_hub import notebook_login
notebook_login()
model.push_to_hub(repo_id='my/repo-id')

运行推理

'''
Run inference on the model using the processor and model from above
'''
from transformers import AutoModelForTokenClassification
import torch

example = dataset["test"][1]
image = example["image"]
words = example["tokens"]
boxes = example["bboxes"]
ner_tags = convert_ner_tags_to_id(example["ner_tags"])

encoding = processor(image, words, boxes=boxes, word_labels=ner_tags, return_tensors="pt")

if torch.cuda.is_available():
  encoding.to("cuda")
  model.to("cuda")

with torch.no_grad():
    outputs = model(**encoding)

logits = outputs.logits
predictions = logits.argmax(-1).squeeze().tolist()
labels = encoding.labels.squeeze().tolist()

def unnormalize_box(bbox, width, height):
     return [
         width * (bbox[0] / 1000),
         height * (bbox[1] / 1000),
         width * (bbox[2] / 1000),
         height * (bbox[3] / 1000),
     ]

token_boxes = encoding.bbox.squeeze().tolist()
width, height = image.size

true_predictions = [model.config.id2label[pred] for pred, label in zip(predictions, labels) if label != - 100]
true_labels = [model.config.id2label[label] for prediction, label in zip(predictions, labels) if label != -100]
true_boxes = [unnormalize_box(box, width, height) for box, label in zip(token_boxes, labels) if label != -100]

可视化预测结果

'''
Some simple utilities for drawing bboxes on images of Driver's Licenses
'''
from PIL import ImageDraw, ImageFont

font = ImageFont.load_default()

def iob_to_label(label):
    label = label[2:]
    if not label:
        return 'other'
    return label

def draw_boxes_on_img(
    preds_or_labels, 
    boxes,
    draw,
    image, 
    unnormalize = False
):
  label_color_lookup = {
      "drivers_license_number": "green",
      "expiration_date": "blue",
      "date_of_birth": "red",
      "first_name": "orange",
      "last_name": "yellow",
      "address": "purple"
  }
  
  for pred_or_label, box in zip(preds_or_labels, boxes):
    label = iob_to_label(pred_or_label).lower()

    if label == 'other':
      continue
    else:
      if unnormalize:
        box = unnormalize_box(box, width, height)
      
      color = label_color_lookup[label]
      draw.rectangle(box, outline=color)
      draw.text((box[0] + 10, box[1] - 10), text=label, fill=color, font=font)

'''
Draw predictions
'''
image = example["image"]
image = image.convert("RGB")

draw = ImageDraw.Draw(image)

draw_boxes_on_img(true_predictions, true_boxes, draw, image)
image

并与实际的地面真实值进行比较：

'''
Compare to ground truch
'''
image = example["image"]
image = image.convert("RGB")

draw = ImageDraw.Draw(image)

draw_boxes_on_img(example['ner_tags'], example['bboxes'], draw, image, True)

image

可以看到我们的 LayoutLM 模型有多准确：

LayoutLM 是一个功能强大的多模态模型，您可以将其应用于许多不同的文档 AI 任务。

在小程序上阅读本文：

Reference：

https://medium.com/@matt.noe/tutorial-how-to-train-layoutlm-on-a-custom-dataset-with-hugging-face-cda58c96571c

https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv3/Fine_tune_LayoutLMv3_on_FUNSD_(HuggingFace_Trainer).ipynb

https://medium.com/@kirudang/automated-text-data-extraction-and-form-filling-system-8c97250da6aa

https://arxiv.org/pdf/2204.08387.pdf

https://github.com/microsoft/unilm/tree/master/layoutlmv3

https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv3/README.md

https://colab.research.google.com/drive/1KtzdLOpLQhHrca8oRM5s8fjKiMgkl5mb#scrollTo=P_bmoVcShf6E

原创文章。转载请注明：作者:JiangYuan 网址: https://www.icnma.com

自动信息抽取系统

1. 文本数据提取

2. 信息采集

正则表达式方法

命名实体识别模型

In-Context Learning Model

LayoutLMv3

注释工具

3. 表格自动填写

4. 微调

将批注转换为 LayoutLM 格式：

自定义数据集上的训练

Reference：

猜你想看

持续集成和部署 （CI/CD）和测试

本硕博都有在马来亚大学读书是一种什么体验？

LLM情感聊天机器人-数据获取方法简析

常用聚类算法

马来西亚理科大学，申请图文详解【USM本科】

持续集成和部署（CI/CD）和测试