---- Regular Expressions and Awk ---- Problem 1: Write regular expressions that match input lines containing: a) at least one single non-negative integer b) 3 non-negative integers separated by whitespace ( and ) c) an alphanumeric word terminating with a colon d) 3) followed by 2) Test your expressions above with shell lines of this type: echo 'abababab' | egrep -o 'ab' echo 'abababab' | gawk '/ab/' Make up both examples and counter-examples. Problem 2: Bash/sed/egrep/gawk rehearsal. What do the following shell lines do? Describe in detail why they work the way they work. echo alsdkjf echo alsdkjf | echo `cat` for i in a l s d k j f; do echo -n $i; done; echo set a l s d k j f; for i in "${@}"; do echo -n ${i}; done; echo echo alsdkjf | (VAR=`cat`; echo $VAR) echo alsdkjf | (VAR=`cat`; echo $VAR > foo$$; cat foo$$;) echo alsdkjf | sed 's/a/b/; s/b/a/' echo alsdkjf | sed '/a*s/q' echo alsdkjf | sed -n '/als/p' echo alsdkjf | egrep '' echo alsdkjf | egrep '(^a)*(f$)' echo alsdkjf | egrep '^a.*f$' echo alsdkjf | egrep '[A-z][0-z][M-t]d.[a-z]+f' echo alsdkjf | tr [xyz] [abc] echo alsdkjf | gawk '{print}' echo alsdkjf | gawk --posix '/.{7}/' echo alsdkjf | gawk --posix '/.{5,100}/' Problem 3: Analyzing a book Reformat the book so there is one sentence per line. Then get word counts for the sentences, and finally construct a list of word counts. # secure a copy of less.txt (foo0) cp less.txt foo0 # I noticed late that there are typos in the book, one of them # relevant for us: isolated periods, such as "Bahnhof ." It would # help if these blanks were removed, e.g., with gawk: gawk '{ gsub(" [.]","."); print; }' less.txt > foo0 # Remove lead paragraphs and chapter headings starting with "*" # (foo0->foo1). The lead paragraphs are those that precede the first # chapter heading. # Replace all newlines '\n' with blanks ' ' (foo1->foo2) # This strings the book out into one looooong line, # which isn't even a line because it's not terminated by "\n". # Note, though, that the words are still separated by blanks. # Break foo2 into words and append '\n' if there is [.:!?] in the word # in the proper position (foo2->foo3) # Idea: Use RS=" " in gawk. This breaks the file into records that # contain exactly one word each, possibly with punctuations attached. # You can then look at each record/word ($0) and determine whether it # constitutes the end of a sentence, based on the punctuation pattern # it contains. # Caveat: It's not possible to capture all possibilities of sentence # endings. For example: there is a sentence that ends on "e.g.", # which I thought would be unEnglish, but there it is. Another # problem: "Ph.D." should NOT be recognized as sentence end. In these # two cases there is probably no automated solution. It is semantic # context that determines sentence end after "e.g." and "Ph.D.", not # syntactically apparent properties. # Cosmetics: Remove leading blanks from every line (foo3->foo4). # Use gawk. # Collect the word counts of every sentence, one count per line (foo4->foo5). # Use gawk. # Create a list of the frequency of each word count: # How many sentences have length 1,2,3,...? (foo5->foo6) # Here I want you to find out yourself what tools to use. # Studying the class notes would definitely help. # Print word count and sentence together on a line (foo4->foo7). # The point is to tag the sentences with their word counts # so we can find the long sentence, for example. # Use gawk. # Comment on the long sentences.