Extracting Form Fields from a Multi-Page PDF with Amazon Textract and .NET

Want to learn more about AWS Lambda and .NET? Check out my A Cloud Guru course on ASP.NET Web API and Lambda.

Download full source code.

In my previous post, I showed how to extract key-value pairs from an image. The Textract client sent a request to the service, and the service returned the results promptly.

However, with a PDF, the file must be uploaded to S3. The request to the Textract service returns a job id. The application must then poll for the results.

Keep in mind this is a demo, and there are some limitations in the code - duplicated keys are removed, as are keys without values.

The pdf to process
The pdf to process

If you want to know more about the process for extracting key-value pairs from a form, see the Textract documentation.

Like the previous post, I use Blazor to load the form and display the extracted key-value pairs. My Blazor skills are limited, so don’t copy/paste the code.

The attached zip has the full source code, so I won’t go through it all here, instead, I’ll show only a few snippets.

Using Textract

Because the file is a PDF, I need to upload it to S3 before it can be processed. Pass the Textract client to the Razor page using dependency injection.

@inject IAmazonTextract TextractClient
@inject IAmazonS3 S3Client

Uploading the file to S3 requires you to have an S3 bucket in place already. See this blog post for more information on creating an S3 bucket.

PutObjectRequest putRequest = new PutObjectRequest
{
    BucketName = "textract-blog-posts", // you won't be able to use this bucket name
    Key = sourcePdf.Name,
    InputStream = sourcePdf.OpenReadStream(1024000),
    ContentType = sourcePdf.ContentType
};

PutObjectResponse response = await S3Client.PutObjectAsync(putRequest);

Then I send a request to Textract to process the file. The response contains a job id, which I use to poll for results. In this example, I am using a simple while loop, but if you are building a production application, you should use a more robust and scalable approach.

var startDocumentAnalysisRequest = new StartDocumentAnalysisRequest
{
    DocumentLocation = new DocumentLocation
    {
        S3Object = new Amazon.Textract.Model.S3Object
        {
            Bucket = "textract-blog-posts",
            Name = sourcePdf.Name
        }
    },
    FeatureTypes = new List<string> { "FORMS" }
};

var startDocumentAnalysisResponse = await textractClient.StartDocumentAnalysisAsync(startDocumentAnalysisRequest);

GetDocumentAnalysisResponse getDocumentAnalysisResponse;
while (true)
{
    getDocumentAnalysisResponse = await textractClient.GetDocumentAnalysisAsync(new GetDocumentAnalysisRequest
    {
        JobId = startDocumentAnalysisResponse.JobId
    });
    if(getDocumentAnalysisResponse.JobStatus != JobStatus.IN_PROGRESS)
    {
        break;
    }
    await Task.Delay(5000);
}

pagesKeysValues = getDocumentAnalysisResponse.GetKeyValuePairs(); 

The last line in the above code calls an extension method to extract the key-value pairs from the response, see the attached zip for the source code. The extension method is for demo purposes only. For a production application, you should read the Textract documentation and implement your own logic.

Output of Textract
Output of Textract

Download full source code.

comments powered by Disqus

Related