Files Module: Intelligent File Handling¶
The intellibricks.files
module is designed to provide a robust and Pythonic way to handle files within your AI applications. It focuses on representing files as RawFile
objects and provides functionalities for parsing and extracting content from various file types.
Core Concepts of the Files Module¶
- RawFile Abstraction: The central class is
RawFile
, which represents a file in a structured manner. It encapsulates: contents
: The raw byte content of the file.name
: The name of the file (without path).extension
: The file extension (e.g., “pdf”, “docx”, “txt”).
- RawFile Abstraction: The central class is
- File Loading:
RawFile
provides convenient class methods for loading files from: File paths (
RawFile.from_file_path
)Bytes data (
RawFile.from_bytes
)In-memory file-like objects (
RawFile.from_file_obj
)
- File Loading:
File Saving: You can save the contents of a
RawFile
to disk usingRawFile.to_file_path
.File Parsing Infrastructure: The module provides a comprehensive system for file parsing, integrating various file parsers to enable extraction of structured information (text, images, tables) from different file types.
File Extension Management: The module helps in managing and determining file extensions, which is crucial for routing files to appropriate parsers.
Working with RawFile Objects¶
Let’s explore how to create and use RawFile
objects.
Creating RawFile from a File Path
Assume you have a file named document.pdf
in your project directory. You can create a RawFile
object like this:
from intellibricks.files import RawFile
file_path = "document.pdf" # Or any path to your file
raw_file = RawFile.from_file_path(file_path)
print(f"File Name: {raw_file.name}")
print(f"File Extension: {raw_file.extension}")
# raw_file.contents now holds the raw bytes of the PDF file
Creating RawFile from Bytes Data
If you have file content in bytes format (e.g., read from a network stream or generated programmatically), you can create a RawFile
using from_bytes
:
file_bytes = b"%PDF-1.5... (PDF file content bytes) ..." # Example PDF bytes
raw_file_from_bytes = RawFile.from_bytes(file_bytes, "report.pdf")
print(f"File Name: {raw_file_from_bytes.name}") # Output: report.pdf
print(f"File Extension: {raw_file_from_bytes.extension}") # Output: pdf
# raw_file_from_bytes.contents holds the provided bytes
Creating RawFile from a File-Like Object
You can also create a RawFile
from an in-memory file-like object (e.g., from io.BytesIO
or when you receive a file object from a web request):
import io
# Simulate an in-memory file object
file_content_str = "This is the content of my text file."
file_obj = io.BytesIO(file_content_str.encode('utf-8'))
raw_file_from_obj = RawFile.from_file_obj(file_obj, "sample.txt")
print(f"File Name: {raw_file_from_obj.name}") # Output: sample.txt
print(f"File Extension: {raw_file_from_obj.extension}") # Output: txt
# raw_file_from_obj.contents holds the bytes read from file_obj
Saving RawFile Contents to Disk
To save the content of a RawFile
to a new file path:
output_path = "output_documents/saved_document.pdf" # Define where to save
raw_file.to_file_path(output_path)
print(f"File saved to: {output_path}")
Using File Parsers¶
IntelliBricks provides a set of file parsers within the intellibricks.files.parsers
module to extract structured content from different file formats. Here’s how you can use them:
Available Parsers
Currently, IntelliBricks offers the following file parsers:
TxtFileParser
: For parsing plain text files (.txt).PdfFileParser
: For parsing PDF documents (.pdf).
You can import these parsers from intellibricks.files.parsers
.
Basic Usage Example
Let’s demonstrate how to parse a PDF file to extract its content:
from intellibricks.files import RawFile
from intellibricks.files.parsers import PdfFileParser, TxtFileParser
# 1. Load a RawFile (e.g., from a file path)
raw_pdf_file = RawFile.from_file_path("document.pdf") # Replace with your PDF file
raw_txt_file = RawFile.from_file_path("document.txt") # Replace with your TXT file
# 2. Instantiate the appropriate parser
pdf_parser = PdfFileParser()
txt_parser = TxtFileParser()
# 3. Extract content using the parser
parsed_pdf_document = pdf_parser.parse(raw_pdf_file)
parsed_txt_document = txt_parser.parse(raw_txt_file)
# 4. Access parsed content (ParsedFile object)
print(f"Parsed document name: {parsed_pdf_document.name}")
for section in parsed_pdf_document.sections:
print(f"\nSection {section.number}:")
print(f" Text (first 100 chars): {section.text[:100]}...")
# ... access other parsed content like images, items, etc.
print(f"Parsed document name: {parsed_txt_document.name}")
for section in parsed_txt_document.sections:
print(f"\nSection {section.number}:")
print(f" Text (first 100 chars): {section.text[:100]}...")
Handling Different File Types
To parse different file types, you would:
Create a
RawFile
object for your file.Instantiate the appropriate parser class based on the file type (e.g.,
PdfFileParser
for PDFs,DocxFileParser
for DOCX files - when available).Call the
parse()
method of the parser, passing theRawFile
object as input.Work with the returned
ParsedFile
object to access the extracted structured content.
Parsed File Structure
The output of file parsing is a ParsedFile
object, which contains:
- Sections: The document is divided into
SectionContent
objects, representing pages or logical sections. EachSectionContent
contains: Text: The extracted text content of the section.
Markdown: A Markdown representation of the section content, including headings, lists, and basic formatting.
Images: A list of
Image
objects found in the section, including image data and metadata.Items: A list of structured
PageItem
objects, representing elements like paragraphs, headings, and tables.
- Sections: The document is divided into
Example of Parsed Content (Illustrative)
from intellibricks.files import RawFile, ParsedFile, PdfFileParser
# Assume you have a RawFile object loaded from a PDF
raw_pdf_file = RawFile.from_file_path("document.pdf")
pdf_parser = PdfFileParser()
parsed_document = pdf_parser.parse(raw_pdf_file)
print(f"Parsed document: {parsed_document.name}")
for section in parsed_document.sections:
print(f"\nSection {section.number}:")
print(f" Text (first 100 chars): {section.text[:100]}...")
if section.images:
print(f" Images found: {len(section.images)}")
if section.items:
print(f" Items found: {len(section.items)}")
# Iterate through items (TextPageItem, HeadingPageItem, TablePageItem, etc.)