Skip to content

Find whole sentences matching a regex in Project Gutenberg

Notifications You must be signed in to change notification settings

hugovk/gutengrep

Repository files navigation

gutengrep

Build Status

Find whole sentences matching a regex in Project Gutenberg plain text files.

Example commands

gutengrep.py "^[^\w]*And then" "*.txt" --cache --sort --correct -o output/and-then.txt

gutengrep.py "^[^\w]*But why" "*.txt" --cache --sort --correct -o output/but-why.txt

gutengrep.py -i "whale" moby11.txt --sort --correct -o out\mobydick-whale.txt

Example output

Name Sorted Regex Input Word count
But why? But why? ^[^\w]*But why *.txt 7,572
And then! And then! [^\w]*And then *.txt 85,014
The whale The whale whale moby11.txt 50,913
Why Why [^\w]*Why *.txt 184,832
Once upon a time Once upon a time -i once upon a time *.txt 6,195
The End The End -i the end\. *.txt 142,94
Happily ever after Happily ever after -i happily ever after *.txt 271
Moonlit Moonlit -i moonlit *.txt 52,345
Moonlight Moonlight -i moonlight *.txt 3,186

See also nanogenmo.md.

Tips

Download the Project Gutenberg August 2003 CD (download and mount the ISO file) and copy all the text files from the 'etext' directories to your hard drive, and put all of the text files in the same directory.

When working on the whole corpus, use --cache to cut down on file operations. The first time it will build a cache file of all tokenised sentences. This first pass takes about 5 minutes on my MBP to go through the 597 books of the Project Gutenberg CD and extract its 3,583,390 sentences. Subsequent runs using the cache take about 40 seconds.

If searching just a single file, or a subset of files, make sure not to use --cache because it will use the cache file generated on the initial file spec.

About

Find whole sentences matching a regex in Project Gutenberg

Resources

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published