Using the Text Mining Tools on Federal Tax Forms

There has been a lot of excitement regarding the new Intelligence Suite released by Alteryx. I am very fortunate and thankful to my sales engineer - @MikeN helping me get the new Text Mining tool palette installed on my computer. Since, I can only say I’ve been addicted to building out use cases and a business case to buy the complete suite out in September 2020.

One of the biggest challenges I’ve faced as an Alteryx Artisan and user in my organization is telling teams throughout the organization that I cannot help them with their problems using Alteryx if most of their data is in PDF format. There were times where I’d suggest converting PDF documents to Excel, or utilize OCR technologies, but these solutions are either inefficient, inconsistent, or very expensive. The new Intelligence Suite Text Mining tool palette has changed that for me going forward.

Describe the business challenge or problem you needed to solve

I’m an accountant. I deal with PDF forms all the time. I send clients PDF forms, I file certain states with PDF forms, and I receive PDF forms from almost everyone.

In the example that spiked this use case, we were onboarding a new client that only had their prior year forms in PDF format. If you are a tax accountant and familiar with Thomson Reuters products and XML filings, you know there are certain ways of moving data within and between systems, however – this was not the case.

I was tasked with manually entering prior year information from the clients PDF federal tax returns into Excel workbooks – much of the information is carried forward on current year federal tax forms.

I was like uhhhh…manually? With my fingers? In Excel? Like Adobe? What?

Describe your working solution All I have to say now is Alteryx – Text Mining Tool Palettes.

Firstly, I used the Image Template tool to map out the annotations (or fields) in my PDF form that I wanted to extract information from.

Then, I navigate to a PDF form with the PDF Input tool also found in the new Text Mining tool palette.

I perform some simple data manipulation to make sure that the pages in my PDF document match with the correct PDF template built into an Image Template tool, and simply run it through the new Image to Text tool. The configuration is very simple if you haven’t used it before.

With further manipulation and Transform tools, I am able to transpose all of the data on the path and page of the data extracted from the PDF form.

I used a batch macro grouped by the actual fields (i.e. annotations) to cleanse all of the data. This is so that I can append a RecordID to each grouped set in instances where there may be a tabbed line of information containing information on multiple lines.

You can see how clean the data looks after it runs through the batch macro.

Also note the point I just referenced where for Line 02a – you actually get three records each with a different RecordID number. This way I know when there is a line 1, a line 2 and a line 3. Often times, this is a name and address.

Describe the benefits you have achieved

This use case and tool we built is going to save us time extracting data from prior year federal tax returns that we only have via PDF file sharing. We also eliminate human error in data entry.

We simply now mapped the Excel output that comes out of workflow to other third party applications, Excel workbooks and Alteryx workflows to continue the process efficiently and quick.

There is so much more to discuss with regards to confirming all of the information that you wanted to receive. I’ve simply built Text Input tools that contain all of the data points I want to extract, and simply use a Find Replace tool to append the extracted data to my template – then I can simply just review what was extracted and what was empty on the form.


Why this over OCR?

Well simply OCR is expensive. It takes a lot of time to map documents to their proper data points, and it requires users to confirm the data extraction in order to assist with the machine learning side of the technology. With the annotations feature in the Image Template tool, I don’t have to confirm data extraction – I know that the tool will always refer to the exact same location.

Where is there to grow?

There are a lot of things I am still unsure of with the new Text Mining tool palette. For example, sometimes when I map out an entire PDF form, the tool actually bugs out and I lose my annotations which I spend a lot of time creating. As well, there are instances where the actual PDF shared with me is printed in a different format or size which also causes error in my data extraction. This is something that proves OCR more consistent and a better investment if this scenario will present itself in many cases. However, for the current investment we make in Alteryx, and the tools that we have at hand – this has become an amazing feature and addition to my data skills library.

If you have any questions on anything discussed in this use case, please feel free to reach out here, Via LinkedIn, or through my Instagram (if you hover over my Profile icon).

I am so happy to share this with the Alteryx community and I can’t wait to see what others build with the text mining tools.

Happy Pride – Happy Summer – Happy Health – Happy Unity.