Testing ChatGPT's Advanced Data Analysis Features

ChatGPT's Code Interpreter has been renamed to Advanced Data Analysis. Testing its features.

Testing ChatGPT's Advanced Data Analysis Features

On August 28, 2023, OpenAI announced ChatGPT Enterprise designed specifically for businesses. According to OpenAI's blog post, this version offers enterprise-grade security and privacy, unlimited high-speed GPT-4 access, longer context windows for processing extended inputs, and more.

ChatGPT's Code Interpreter has been renamed to Advanced Data Analysis.

I decided to test how well ChatGPT's Advanced Data Analysis can read and extract structured data from a PDF file. Spoiler: After some trial and error, it worked!

Note: This is NOT a practical guide on how to perform OCR or PDF parsing using ChatGPT. However, we'll learn a lot about ChatGPT’s Advanced Data Analysis features, how it works, and the libraries it uses.

For a detailed guide to parsing PDF files, check out: A Step-by-Step Guide to Extracting Data from PDFs with ChatGPT.

Let's Get Started!

I used a sample invoice for my testing—a text-based (searchable) PDF where the content can be copied to the clipboard.

I crafted a simple prompt and attached the PDF.

Read the attached PDF and extract structured data in JSON format from this invoice:
* invoice_id
* company_name
* invoice_date
* customer_fullname
* total_amount
* line_items: Array of objects with the following structure: 'qty', 'description', 'unit_price', 'total'

ChatGPT segmented the task into three logical steps, using Python and the PyPDF2 library.

Oops! Text extraction failed. The output suggests that the issue could be due to a deprecated function, recommending the use of Page.extract_text. ChatGPT had a different plan. It decided to turn the PDF into a picture and employ OCR for text extraction.

This is getting interesting. Let’s proceed and take a look at the converted image.

Impressively, ChatGPT uses the pdf2image library to convert the PDF to an image and Matplotlib to add a title: "First Page of Invoice PDF". Here, we also discover that ChatGPT can output images and even offer a download link.

Let's move on to extract the raw invoice data from this image.

ChatGPT opts for pytesseract for OCR, a commonly-used choice. With a single line of code, it transforms our image into raw text and subsequently employs regular expressions to extract the fields specified in our initial prompt.

The extraction successfully captured fields like invoice_id, company_name, and invoice_date, but failed to obtain others.

{
    "invoice_id": "INV2023/115",
    "company_name": "Digital Growth Agency",
    "invoice_date": "01/02/2023",
    "customer_fullname": "SHIP TO:",
    "total_amount": null,
    "line_items": []
}

Upon closer examination, it's clear why certain fields were missing—the plain text simply did not contain them.

Digital Growth Agency INVOICE

Elevate your online game with us.

1138 Arron Smith Drive
Stockton, California, 95219
Phone: 555-555-5555 Fax: 555-555-5555

INVOICE #INV2023/115
DATE: 01/02/2023

TO: SHIP TO:
Patrick Hill Patrick Hill
3680 Willow Greene Drive 3680 Willow Greene Drive
Montgomery, Alabama, 36109 Montgomery, Alabama, 36109
Phone: 333-111-4321 Phone: 333-111-4321
P.O. NUMBER REQUISITIONER SHIPPED VIA | F.O.B. POINT TERMS
PO12355901 DHL Due on
receipt
QTY DESCRIPTION UNIT PRICE TOTAL

Reviewing the code, it’s evident that ChatGPT truncated the text to the first 500 characters. A second OCR attempt was needed.

Almost there!

During this test, ChatGPT unexpectedly lost the session, requiring me to re-upload the original PDF. After converting it once more into an image and running pytesseract, the raw text appeared as follows:

Digital Growth Agency INVOICE

Elevate your online game with us.

1138 Arron Smith Drive
Stockton, California, 95219
Phone: 555-555-5555 Fax: 555-555-5555

INVOICE #INV2023/115
DATE: 01/02/2023

TO: SHIP TO:
Patrick Hill Patrick Hill
3680 Willow Greene Drive 3680 Willow Greene Drive
Montgomery, Alabama, 36109 Montgomery, Alabama, 36109
Phone: 333-111-4321 Phone: 333-111-4321
P.O. NUMBER REQUISITIONER SHIPPED VIA | F.O.B. POINT TERMS
PO12355901 DHL Due on
receipt
QTY DESCRIPTION UNIT PRICE TOTAL
1 Website redesign $900.00 $900.00
2 Blog article $50.00 $100.00
50 Business cards x100 $10.90 $545.00

SUBTOTAL $1545.00
SALES TAX $309.00
SHIPPING & HANDLING $40.00
TOTAL DUE $1894.00

Make all checks payable to Digital Growth Agency
If you have any questions concerning this invoice, contact Peter Smith, 555-555-5555, [email protected]

THANK YOU FOR YOUR BUSINESS!

The updated JSON was significantly improved but still included inaccuracies in the company_name and customer_fullname fields.

{
    "invoice_id": "INV2023/115",
    "company_name": "Digital Growth Agency INVOICE",
    "invoice_date": "01/02/2023",
    "customer_fullname": "SHIP TO:",
    "total_amount": "1894.00",
    "line_items": [
        {
            "qty": "1",
            "description": "Website redesign",
            "unit_price": "900.00",
            "total": "900.00"
        },
        {
            "qty": "2",
            "description": "Blog article",
            "unit_price": "50.00",
            "total": "100.00"
        },
        {
            "qty": "50",
            "description": "Business cards x100",
            "unit_price": "10.90",
            "total": "545.00"
        }
    ]
}

After checking, ChatGPT found the errors and suggested fixes. After a few more tries, it got everything right!

{
    "invoice_id": "INV2023/115",
    "company_name": "Digital Growth Agency",
    "invoice_date": "01/02/2023",
    "customer_fullname": "Patrick Hill",
    "total_amount": "1894.00",
    ...
}

Conclusion

That was fun! We learned that ChatGPT's Advanced Data Analysis can:

  • Generate, debug, and execute code.
  • Extract text from PDFs.
  • Convert PDFs to images and output them.
  • Generate files and provide download links.
  • Use OCR to convert images into raw text.
  • Create, debug, and refine regular expressions.
  • Validate extracted data and identify errors.