This discusses about using Perl, how we can obtain a data file in a particular desired format, given a data file in a random format. By format, we refer to the order of columns in the input data file. Using this Perl script, we obtain output data file in a particular format and we get another file having discarded data lines, based on some filtering condition.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 | #!/usr/local/bin/perl
##############################################################################
# DESCRIPTION:
#
# This script takes data file, and extra attribute as inputs, and gives output
# data file in a particular format. It also addes extra attribute in beginning
# of each line in the output file.
#
# Additionally, if the number of fields in a line in the input file does not
# match with the number of pre-defined fields, then such lines are filtered
# and are stored in a discard output file.
#
# Version: 0.1 13-Feb-2007 Rajkumar Jain Initial Version
############################################################################
# Input file format is as following.
#
# ATTRIBUTE03|ATTRIBUTE01|ATTRIBUTE05|ATTRIBUTE02|ATTRIBUTE04|
# 01_attr_03|01_attr_01|01_attr_05|01_attr_02|01_attr_04|
# 02_attr_03|02_attr_01|02_attr_05|02_attr_02||
# 03_attr_03|03_attr_01|03_attr_05|03_attr_02|
# 04_attr_03|04_attr_01|04_attr_05|
# 05_attr_03|05_attr_01|05_attr_05|||05_attr_04|
# 06_attr_03|06_attr_01|06_attr_05|06_attr_02|06_attr_04|
#
$num_args = @ARGV;
# If wrong number of arguments are passed to Perl script, then exit
if ($num_args != 2)
{
print "Incorrect number of arguments to perl \n";
exit 1;
}
$inp_file=@ARGV[0];
$extra_attr=@ARGV[1];
# Array containing column names (headers) in pre-defined order
# This can be changed as needed
@col_names = ( "ATTRIBUTE01",
"ATTRIBUTE02",
"ATTRIBUTE03",
"ATTRIBUTE04",
"ATTRIBUTE05"
);
$file_line_count=1;
$flag = 0;
# Defining the output data file and discard data file names
$output_file=$inp_file."."."out";
$discard_file=$inp_file."."."discard";
# Defining File handles for input and output files
open (INPUT, "<$inp_file") || die "Can't open file for reading";
open (OUTPUT,">$output_file") || die "Can't open the output file for writing";
# Printing header names in the first line of output file
print OUTPUT "$extra_attr"."|";
foreach $column_name (@col_names) {
print OUTPUT "$column_name"."|";
}
print OUTPUT "\n";
# Loop to itertate through each line of input file
while(<INPUT>)
{
$elements_count = 0;
# This processing is for the first line (containing headers).
# We re-order the columns as per the order defined in array @col_names
if($file_line_count == 1)
{
@line_items = split(/\|/);
$header_count = @line_items;
$file_line_count=$file_line_count +1;
# Prepare a hash map between headercolumns in the input file
# and its index in the the col array
for ($i=0; $i<$header_count - 1;$i++)
{
for ($j=0; $j<=$#col_names ;$j++) {
if($line_items[$i] eq $col_names[$j] ) {
$HashOrder{$col_names[$j]} = $i;
}
}
}
# Assign "" to all those columns which dont contain values
for ($j=0; $j<=$#col_names ;$j++) {
if ($HashOrder{$col_names[$j]} != 0 && !($HashOrder{$col_names[$j]}) ) {
$HashOrder{$col_names[$j]} = "";
}
}
next;
}
#Get the current line
$cur_file_line=$_;
# Remove the spaces from beginning and end of the current line
# This is similar to trimming the line
$cur_file_line =~ s/^\s//;
$cur_file_line =~ s/\s$//;
$file_line="";
if ($cur_file_line ne "")
{
@line_items = split(/\|/);
$elements_count = @line_items;
# If the number of values in the line matches with the number of headers,
# then we store the line in the output file
if($elements_count == $header_count)
{
#rearrange according to the column positions
for ($j=0; $j<=$#col_names ;$j++) {
if($HashOrder{$col_names[$j]} ne "")
{
$file_line = $file_line.$line_items[$HashOrder{$col_names[$j]}];
}
$file_line = $file_line."|";
}
# $extra_attr is prefixed in every line and then line
# is printed in output file
$file_line = $extra_attr."|".$file_line."\n";
print OUTPUT "$file_line";
}
# If the number of values in the line does not match with the number of headers,
# then we store the line in the discard output file
else
{
# The message below (The following.... ) is printed only once in
# the discard output file. This is taken care of by value of $flag.
if ($flag == 0)
{
open (DISCARD,">$discard_file") || die "Can't open the discard file for writing";
print DISCARD "The following lines failed the field count validation \n";
$flag =1;
}
# Printing the line of discard data in discard output file
print DISCARD "$cur_file_line"."\n";
}
} #End of ($cur_file_line ne "")
} #End of While
# Closing file handles
close(OUTPUT);
if($flag == 1)
{
close(DISCARD);
}
close(INPUT);
|
This will be useful, especially when we need data in a particular format in the data file, so as to store that data in database further. This is because once we obtain data file in a particular format, the logic/code for processing the data from data file for storing in database need not be changed every time. This Perl script takes care of converting input data file to a data file in a particular pre-defined format.
Additionally, the records in the input data file can be filtered out based on some condition. For example, here we discard the lines if the number of fields in a line does not match the pre-defined number of fields. These discarded records can then be kept in another discard output data file.
Hi I am trying to find out how to know what each one of the modules does in active or how to find out what they do. I dont want to add a ppm if I all ready have one that is stock and it does what I want.