gutengrep

Find whole sentences matching a regex in Project Gutenberg plain text files.

Example commands

gutengrep.py "^[^\w]*And then" "*.txt" --cache --sort --correct -o output/and-then.txt

gutengrep.py "^[^\w]*But why" "*.txt" --cache --sort --correct -o output/but-why.txt

gutengrep.py -i "whale" moby11.txt --sort --correct -o out\mobydick-whale.txt

Example output

Name	Sorted	Regex	Input	Word count
But why?	But why?	`^[^\w]*But why`	`*.txt`	7,572
And then!	And then!	`[^\w]*And then`	`*.txt`	85,014
The whale	The whale	`whale`	`moby11.txt`	50,913
Why	Why	`[^\w]*Why`	`*.txt`	184,832
Once upon a time	Once upon a time	`-i` `once upon a time`	`*.txt`	6,195
The End	The End	`-i` `the end\.`	`*.txt`	142,94
Happily ever after	Happily ever after	`-i` `happily ever after`	`*.txt`	271
Moonlit	Moonlit	`-i` `moonlit`	`*.txt`	52,345
Moonlight	Moonlight	`-i` `moonlight`	`*.txt`	3,186

Tips

Download the Project Gutenberg August 2003 CD (download and mount the ISO file) and copy all the text files from the 'etext' directories to your hard drive, and put all of the text files in the same directory.

When working on the whole corpus, use --cache to cut down on file operations. The first time it will build a cache file of all tokenised sentences. This first pass takes about 5 minutes on my MBP to go through the 597 books of the Project Gutenberg CD and extract its 3,583,390 sentences. Subsequent runs using the cache take about 40 seconds.

If searching just a single file, or a subset of files, make sure not to use --cache because it will use the cache file generated on the initial file spec.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
output		output
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
gutencounter.py		gutencounter.py
gutengrep.py		gutengrep.py
gutenstory.py		gutenstory.py
nanogenmo.md		nanogenmo.md
requirements.txt		requirements.txt
test_gutengrep.py		test_gutengrep.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

output

output

.gitignore

.gitignore

.travis.yml

.travis.yml

README.md

README.md

gutencounter.py

gutencounter.py

gutengrep.py

gutengrep.py

gutenstory.py

gutenstory.py

nanogenmo.md

nanogenmo.md

requirements.txt

requirements.txt

test_gutengrep.py

test_gutengrep.py

Repository files navigation

gutengrep

Example commands

Example output

Tips

About

Releases

Sponsor this project

Packages

Contributors 2

Languages

hugovk/gutengrep

Folders and files

Latest commit

History

Repository files navigation

gutengrep

Example commands

Example output

Tips

About

Resources

Stars

Watchers

Forks

Sponsor this project

Languages