Python Script to Read and Display Word Document Content
Written on
Chapter 1: Introduction to Document Reading
In this section, we will explore a Python script designed to read and present the number of paragraphs and their respective content from a specified Word document. This functionality is particularly beneficial for tasks such as document statistics and text analysis.
The following code snippet utilizes the docx module to access a specified Word document. It initializes the Document object with the desired Word file, allowing it to be loaded into memory. The script calculates the total number of paragraphs using len(doc.paragraphs) and displays it. Each paragraph is then iterated over with the enumerate function, which provides both the index and content of each paragraph. The results are printed using formatted strings.
import docx
doc = docx.Document("script.docx")
print(f"Number of paragraphs: {len(doc.paragraphs)}")
for count, para in enumerate(doc.paragraphs, start=1):
print(f"{count}: {para.text}")
Chapter 2: Retrieving and Modifying Paragraph Content
The subsequent code snippet opens a Word document (.docx) using the python-docx package. This code retrieves the content and number of runs in specific paragraphs and displays the text of each run. Finally, the modified document is saved to a specified file path.
The code starts by opening the document at the defined path using docx.Document(doc_path). It then counts the paragraphs with len(doc.paragraphs) and prints the result.
Next, it processes selected paragraph indices (in this case, indices 0 and 2) and retrieves the paragraph object for the specified index using doc.paragraphs[p_idx]. The script counts the runs in the paragraph with len(para.runs) and prints this information.
In the next loop, it iterates through each run within the paragraph, extracting and printing the text content of each run using run.text. Finally, the modified document is saved using doc.save("Modified_Document.docx").
import docx
doc_path = "script.docx"
doc = docx.Document(doc_path)
print("Number of paragraphs: ", len(doc.paragraphs))
for p_idx in [0, 2]:
para = doc.paragraphs[p_idx]
print("Number of runs: ", len(para.runs))
for run in para.runs:
print(run.text)
doc.save("Modified_Document.docx") # Specify the path for saving
Chapter 3: Extracting Heading Paragraphs
The following code focuses on reading and printing the heading paragraphs (those with style names starting with "Heading") from a specified Word document. Utilizing the python-docx module, this code effectively handles the Word document's contents.
- Importing the `docx` Module: This line allows access to the functionalities provided by the python-docx library.
- Opening the Word Document: A doc object is created by invoking the Document class to open the specified Word document ("script.docx"). This enables reading and processing the document's content.
- Counting Paragraphs: The total number of paragraphs in the document is calculated using the len function and printed.
- Iterating Through Paragraphs: A loop iterates through each paragraph in the document, checking if the style name begins with "Heading". If true, the text of that paragraph is printed.
In summary, this code reads the total number of paragraphs in a specified Word document and prints all heading paragraphs.
import docx
doc = docx.Document("script.docx")
print("Paragraphs: ", len(doc.paragraphs))
for para in doc.paragraphs:
if para.style.name.startswith('Heading'):
print(para.text)