PHP: Processing BBC Daily Email

If you receive these BBC daily emails too then you will have noticed that they are fairly redundant. A news item that appears in the "TOP STORIES" category, for instance, can also appear in the "AFRICA" category. In fact it's not unusual to find the same item in two or three categories.
Here's a PHP script that accepts one daily BBC news email and outputs a page in which each news item is listed just once.

To use this script just open it in a browser and copy and paste an entire BBC daily email into the form's text area and press the submit button.

This script uses the Finite State Machine class that is available at http://pear.php.net/package/FSM/docs/latest/FSM/FSM.html.

 

<?phprequire_once 'FSM.php';class FSM_1 extends FSM {
    function setPayload ( $payload ) { $this->_payload = $payload; }
    function getPayload ( ) { return $this -> _payload; }
}function SectionName ( $symbol, $payload ) { $payload = "Section:"; }
function ItemTitle ( $symbol, $payload ) { $payload = "Title:"; }
function ItemSummary ( $symbol, $payload ) { $payload = "Summary:"; }
function ItemURL ( $symbol, $payload ) { $payload = "URL:" ; }$stack = array ( );
$f = new FSM_1 ( 'WAITING', $stack );$f -> setDefaultTransition ( null, 'WAITING' );
$f -> addTransition ( '.', 'WAITING', 'SECTION_NAME', null );
$f -> addTransitionAny  ( 'SECTION_NAME', 'DELIM_OR_ITEM_TITLE', 'SectionName' );
$f -> addTransition ( '*', 'DELIM_OR_ITEM_TITLE', 'ITEM_SUMMARY', 'ItemTitle' );
$f -> addTransition ( '.', 'DELIM_OR_ITEM_TITLE', 'SECTION_NAME', null );
$f -> addTransitionAny ( 'DELIM_OR_ITEM_TITLE', 'ITEM_TITLE', 'SectionName' );
$f -> addTransitionAny ( 'ITEM_SUMMARY', 'ITEM_Ignore', 'ItemSummary' );
$f -> addTransitionAny ( 'ITEM_TITLE', 'ITEM_Ignore', null );
$f -> addTransitionAny ( 'ITEM_Ignore', 'ITEM_URL', null );
$f -> addTransitionAny ( 'ITEM_URL', 'DELIM_OR_ITEM_TITLE', 'ItemURL' );if ( $_POST [ 'submitid' ] == 1 ) {
    $titles = array ( );
    $ignore = 0;
    foreach ( explode ( "\n", $_POST [ 'news' ] ) as $line ) {
        $line = trim ( $line ) ;
        if ( $line == '' ) { continue; }
        $f -> setPayload ( '' );

        $f -> process ( substr ( $line, 0, 1 ) );
        $payload = $f -> getPayload ( );
        if ( $payload != '' ) {
            if ( $payload == 'Section:' and $line == 'OPTIONS AND HELP' ) { break; }
            if ( $payload == 'Title:' ) {
                if ( in_array ( $line, $titles ) ) { $ignore = 1; }
                else {
                    $ignore = 0;
                    array_push ( $titles, $line );
                }
            }
            if ( $ignore == 0 ) {
                if ( $payload == 'Title:' ) { echo stripslashes ( $line ) . "<br/>\n"; }
                if ( $payload == 'Summary:' ) { echo stripslashes ( $line ) . "<br/>\n"; }
                if ( $payload == 'URL:') { echo "<a href='" . $line . "' target='_blank'>" . $line . "</a><p/>\n"; }
            }
        }
    }
    echo "<p/>\n";
}
?><form action="<? echo $_SERVER['PHP_SELF'];?>" method="post" name="adminlogin" id="adminlogin" style="display:inline;">
    <textarea name="news" id="news" rows="20" cols="200"></textarea><br/>
    <input name="Submit" type="submit" id="Submit" value="Submit">
    <input name="submitid" type="hidden" id="submitid" value="1"/>
</form>

Here are some notes about how it works:

Fortunately the lines in the BBC emails are sequenced in a particularly simply way and this is reflected in the collection of transition rules.
When a transition rule results in a call to a function such as SectionName, indicating that a section name (such as "TOP STORIES") has been recognised, the function signals this fact by setting $payload accordingly. I have extended the base class so that the $payload that is made available to these functions is also available to codes that instantiate the extended class.
The result is that each time a line is read from the daily email it is passed to the (extended) FSM for recognition and then processed accordingly.
One minor point: The "cols" setting for the textarea in the form may seem excessively large. It is set this way so that lines from the email are not folded, so that parsing is not made unnecessarily complicated.

 

Share this article!

Follow us!

Find more helpful articles: