---- Regular Expressions and Awk ----


Problem 1: 

Write regular expressions that match input lines containing:

a) at least one single non-negative integer

b) 3 non-negative integers separated by whitespace (<Space> and <Tab>)

c) an alphanumeric word terminating with a colon

d) 3) followed by 2)

Test your expressions above with shell lines of this type:
  echo 'abababab' | egrep -o 'ab'
  echo 'abababab' | gawk '/ab/'
Make up both examples and counter-examples.


Problem 2: Bash/sed/egrep/gawk rehearsal.  
           What do the following shell lines do?
           Describe in detail why they work the way they work.

  echo alsdkjf
  echo alsdkjf | echo `cat`
  for i in a l s d k j f; do echo -n $i; done; echo
  set a l s d k j f; for i in "${@}"; do echo -n ${i}; done; echo
  echo alsdkjf | (VAR=`cat`; echo $VAR)
  echo alsdkjf | (VAR=`cat`; echo $VAR > foo$$; cat foo$$;)
  echo alsdkjf | sed 's/a/b/; s/b/a/'
  echo alsdkjf | sed '/a*s/q'
  echo alsdkjf | sed -n '/als/p'
  echo alsdkjf | egrep ''
  echo alsdkjf | egrep '(^a)*(f$)'
  echo alsdkjf | egrep '^a.*f$'
  echo alsdkjf | egrep '[A-z][0-z][M-t]d.[a-z]+f'
  echo alsdkjf | tr [xyz] [abc]
  echo alsdkjf | gawk '{print}'
  echo alsdkjf | gawk --posix '/.{7}/'
  echo alsdkjf | gawk --posix '/.{5,100}/'


Problem 3: Analyzing a book
           Reformat the book so there is one sentence per line.
           Then get word counts for the sentences,
           and finally construct a list of word counts.

# secure a copy of less.txt (foo0)
cp less.txt foo0

# I noticed late that there are typos in the book, one of them
# relevant for us: isolated periods, such as "Bahnhof ."  It would
# help if these blanks were removed, e.g., with gawk: 
gawk '{ gsub(" [.]","."); print; }' less.txt > foo0

# Remove lead paragraphs and chapter headings starting with "*"
# (foo0->foo1).  The lead paragraphs are those that precede the first
# chapter heading.

# Replace all newlines '\n' with blanks ' ' (foo1->foo2)
# This strings the book out into one looooong line,
# which isn't even a line because it's not terminated by "\n".
# Note, though, that the words are still separated by blanks.

# Break foo2 into words and append '\n' if there is [.:!?] in the word
# in the proper position (foo2->foo3) 
# Idea: Use RS=" " in gawk.  This breaks the file into records that
# contain exactly one word each, possibly with punctuations attached.
# You can then look at each record/word ($0) and determine whether it
# constitutes the end of a sentence, based on the punctuation pattern 
# it contains.

# Caveat: It's not possible to capture all possibilities of sentence
# endings.  For example: there is a sentence that ends on "e.g.",
# which I thought would be unEnglish, but there it is.  Another
# problem: "Ph.D." should NOT be recognized as sentence end.  In these
# two cases there is probably no automated solution.  It is semantic
# context that determines sentence end after "e.g." and "Ph.D.", not
# syntactically apparent properties.

# Cosmetics: Remove leading blanks from every line (foo3->foo4).
# Use gawk.

# Collect the word counts of every sentence, one count per line (foo4->foo5).
# Use gawk.

# Create a list of the frequency of each word count: 
# How many sentences have length 1,2,3,...?  (foo5->foo6)
# Here I want you to find out yourself what tools to use.
# Studying the class notes would definitely help.

# Print word count and sentence together on a line (foo4->foo7).
# The point is to tag the sentences with their word counts 
# so we can find the long sentence, for example.
# Use gawk.

# Comment on the long sentences.