Monthly Archives: April 2017

Software Bugs

A software bug is an error, flaw, failure or fault in a computer program or system that causes it to produce an incorrect or unexpected result, or to behave in unintended ways.

Most bugs arise from mistakes and errors made in either a program’s source code or its design, or in components and operating systems used by such programs. A few are caused by compilers producing incorrect code. A program that contains a large number of bugs, and/or bugs that seriously interfere with its functionality, is said to be buggy (defective). Bugs trigger errors that may have ripple effects. Bugs may have subtle effects or cause the program to crash or freeze the computer. Others qualify as security bugs and might, for example, enable a malicious user to bypass access controls in order to obtain unauthorized privileges.

The software industry has put much effort into reducing bug counts. These include:

Typographical errors

Bugs usually appear when the programmer makes a logic error. Various innovations in programming style and defensive programming are designed to make these bugs less likely, or easier to spot. Some typos, especially of symbols or logical/mathematical operators, allow the program to operate incorrectly, while others such as a missing symbol or misspelled name may prevent the program from operating. Compiled languages can reveal some typos when the source code is compiled.

Development methodologies

Several schemes assist managing programmer activity so that fewer bugs are produced. Software engineering (which addresses software design issues as well) applies many techniques to prevent defects. For example, formal program specifications state the exact behavior of programs so that design bugs may be eliminated. Unfortunately, formal specifications are impractical for anything but the shortest programs, because of problems of combinatorial explosion and indeterminacy.

Unit testing involves writing a test for every function (unit) that a program is to perform. In test-driven development unit tests are written before the code and the code is not considered complete until all tests complete successfully. Agile software development involves frequent software releases with relatively small changes. Defects are revealed by user feedback.

Programming language support

Programming languages include features to help prevent bugs, such as static type systems, restricted namespaces and modular programming. For example, when a programmer writes (pseudocode) LET REAL_VALUE PI = "THREE AND A BIT", although this may be syntactically correct, the code fails a type check. Compiled languages catch this without having to run the program. Interpreted languages catch such errors at runtime. Some languages deliberate exclude features that easily lead to bugs, at the expense of slower performance: the general principle being that, it is almost always better to write simpler, slower code than inscrutable code that runs slightly faster, especially considering that maintenance cost is substantial. For example, the Java programming language does not support pointer arithmetic; implementations of some languages such as Pascal and scripting languages often have runtime bounds checking of arrays, at least in a debugging build.

Code analysis

Tools for code analysis help developers by inspecting the program text beyond the compiler’s capabilities to spot potential problems. Although in general the problem of finding all programming errors given a specification is not solvable (see halting problem), these tools exploit the fact that human programmers tend to make certain kinds of simple mistakes often when writing software.

Instrumentation

Tools to monitor the performance of the software as it is running, either specifically to find problems such as bottlenecks or to give assurance as to correct working, may be embedded in the code explicitly (perhaps as simple as a statement saying PRINT "I AM HERE"), or provided as tools. It is often a surprise to find where most of the time is taken by a piece of code, and this removal of assumptions might cause the code to be rewritten.

Testing

Software testers are people whose primary task is to find bugs, or write code to support testing. On some projects, more resources may be spent on testing than in developing the program.

Measurements during testing can provide an estimate of the number of likely bugs remaining; this becomes more reliable the longer a product is tested and developed.[citation needed]

Debugging

The typical bug history (GNU Classpath project data). A new bug submitted by the user is unconfirmed. Once it has been reproduced by a developer, it is a confirmed bug. The confirmed bugs are later fixed. Bugs belonging to other categories (unreproducible, will not be fixed, etc.) are usually in the minority. Finding and fixing bugs, or debugging, is a major part of computer programming. Maurice Wilkes, an early computing pioneer, described his realization in the late 1940s that much of the rest of his life would be spent finding mistakes in his own programs.

Usually, the most difficult part of debugging is finding the bug. Once it is found, correcting it is usually relatively easy. Programs known as debuggers help programmers locate bugs by executing code line by line, watching variable values, and other features to observe program behavior. Without a debugger, code may be added so that messages or values may be written to a console or to a window or log file to trace program execution or show values. More typically, the first step in locating a bug is to reproduce it reliably. Once the bug is reproducible, the programmer may use a debugger or other tool while reproducing the error to find the point at which the program went astray.

Bug management

Bug management includes the process of documenting, categorizing, assigning, reproducing, correcting and releasing the corrected code. Proposed changes to software – bugs as well as enhancement requests and even entire releases – are commonly tracked and managed using bug tracking systems or issue tracking systems. The items added may be called defects, tickets, issues, or, following the agile development paradigm, stories and epics. Categories may be objective, subjective or a combination, such as version number, area of the software, severity and priority, as well as what type of issue it is, such as a feature request or a bug.

 

Communications protocol

In telecommunications, a communication protocol is a system of rules that allow two or more entities of a communications system to transmit information via any kind of variation of a physical quantity. The protocol defines the rules syntax, semantics and synchronization of communication and possible error recovery methods. Protocols may be implemented by hardware, software, or a combination of both.

Communicating systems use well-defined formats (protocol) for exchanging various messages. Each message has an exact meaning intended to elicit a response from a range of possible responses pre-determined for that particular situation. The specified behavior is typically independent of how it is to be implemented. Communications protocols have to be agreed upon by the parties involved. To reach agreement, a protocol may be developed into a technical standard. A programming language describes the same for computations, so there is a close analogy between protocols and programming languages: protocols are to communications what programming languages are to computations.

Multiple protocols often describe different aspects of a single communication. A group of protocols designed to work together are known as a protocol suite; when implemented in software they are a protocol stack.

Internet communication protocols are published by the Internet Engineering Task Force (IETF). The IEEE handles wired and wireless networking; the International Organization for Standardization (ISO) other types. The ITU-T handles telecommunications protocols and formats for the public switched telephone network (PSTN). As the PSTN and Internet converge, the standards are also being driven towards convergence.

Protocols are to communications what algorithms or programming languages are to computations.

This analogy has important consequences for both the design and the development of protocols. One has to consider the fact that algorithms, programs and protocols are just different ways of describing expected behavior of interacting objects. A familiar example of a protocolling language is the HTML language used to describe web pages which are the actual web protocols.

In programming languages the association of identifiers to a value is termed a definition. Program text is structured using block constructs and definitions can be local to a block. The localized association of an identifier to a value established by a definition is termed a binding and the region of program text in which a binding is effective is known as its scope. The computational state is kept using two components: the environment, used as a record of identifier bindings, and the store, which is used as a record of the effects of assignments.

That’s not because TCP is perfect or because computer scientists have had trouble coming up with possible alternatives; it’s because those alternatives are too hard to test. The routers in data center networks have their traffic management protocols hardwired into them. Testing a new protocol means replacing the existing network hardware with either reconfigurable chips, which are labor-intensive to program, or software-controlled routers, which are so slow that they render large-scale testing impractical. The system maintains a compact, efficient computational model of a network running the new protocol, with virtual data packets that bounce around among virtual routers. On the basis of the model, it schedules transmissions on the real network to produce the same traffic patterns. Researchers could thus run real web applications on the network servers and get an accurate sense of how the new protocol would affect their performance.

Traffic control

Each packet of data sent over a computer network has two parts: the header and the payload. The payload contains the data the recipient is interested in — image data, audio data, text data, and so on. The header contains the sender’s address, the recipient’s address, and other information that routers and end users can use to manage transmissions. When multiple packets reach a router at the same time, they’re put into a queue and processed sequentially. With TCP, if the queue gets too long, subsequent packets are simply dropped; they never reach their recipients. When a sending computer realizes that its packets are being dropped, it cuts its transmission rate in half, then slowly ratchets it back up. A better protocol might enable a router to flip bits in packet headers to let end users know that the network is congested, so they can throttle back transmission rates before packets get dropped. Or it might assign different types of packets different priorities, and keep the transmission rates up as long as the high-priority traffic is still getting through. These are the types of strategies that computer scientists are interested in testing out on real networks.

Speedy simulation

When a server on the real network wants to transmit data, it sends a request to the emulator, which sends a dummy packet over a virtual network governed by the new protocol. When the dummy packet reaches its destination, the emulator tells the real server that it can go ahead and send its real packet. If, while passing through the virtual network, a dummy packet has some of its header bits flipped, the real server flips the corresponding bits in the real packet before sending it. If a clogged router on the virtual network drops a dummy packet, the corresponding real packet is never sent. And if, on the virtual network, a higher-priority dummy packet reaches a router after a lower-priority packet but jumps ahead of it in the queue, then on the real network, the higher-priority packet is sent first.

The servers on the network thus see the same packets in the same sequence that they would if the real routers were running the new protocol. There’s a slight delay between the first request issued by the first server and the first transmission instruction issued by the emulator. But thereafter, the servers issue packets at normal network speeds. The ability to use real servers running real web applications offers a significant advantage over another popular technique for testing new network management schemes: software simulation, which generally uses statistical patterns to characterize the applications’ behavior in a computationally efficient manner.

 

Taming Big Data

Big Data can be a beast. Data volumes are growing exponentially.The types of data being created are likewise proliferating. And the speed at which data is being created – and the need to analyze it in near real-time to derive value from it – is increasing with each passing hour.

But Big Data can be tamed. We’ve got living proof. Thanks to new approaches for processing, storing and analyzing massive volumes of multi-structured data – such as Hadoop and MPP analytic databases — enterprises of all types are uncovering new and valuable insights from Big Data everyday.

Leading the way are Web giants like Facebook, LinkedIn and Amazon. Following close behind are early adopters in financial services, healthcare and media. And now it’s your turn. From marketing campaign analysis and social graph analysis to network monitoring, fraud detection and risk modeling, there’s unquestionably a Big Data use case out there with your company’s name on it.

In an era where Big Data can greatly impact a broad population, many novel opportunities arise, chief among them the ability to integrate data from diverse sources and “wrangle” it to extract novel insights. Conceived as a tool that can help both expert and non-expert users better understand public data, MATTERS was collaboratively developed by the Massachusetts High Tech Council, WPI and other institutions as an analytic platform offering dynamic modeling capabilities. MATTERS is an integrative data source on high fidelity cost and talent competitiveness metrics. Its goal is to extract, integrate and model rich economic, financial, educational and technological information from renowned heterogeneous web data sources ranging from The US Census Bureau, The Bureau of Labor Statistics to the Institute of Education Sciences, all known to be critical factors influencing economic competitiveness of states. This demonstration of MATTERS illustrates how we tackle challenges of data acquisition, cleaning, integration and wrangling into appropriate representations, visualization and story-telling with data in the context of state competitiveness in the high-tech sector.

Unstructured data refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.

Data in greater volume, velocity and variety has some business leaders riding a big data analytics tiger in search of new commercial opportunities. Now, several years into the big data era, some taming of the tiger seems to be in order. That’s according to several data professionals from large enterprises on hand at IBM’s recent Information On Demand (IOD) conference in Las Vegas. While they see potential for a new breed of data-driven applications, they also see a need to reign in unbridled efforts, which means applying more rigorous planning, refining analytics skills and instituting more data governance.

Data variety and ‘dangerous insights’

Telecommunications giant Verizon Wireless, also headquartered in New York, has always had data volume and velocity to deal with. What’s new is the variety of data that big data analytics must work with, said Ksenija Draskovic, a Verizon predictive analytics and data science manager, who discussed the implications for predictive analytics in another IOD session.

C-suite buy-in of big data kind

Madhavan said he and his JP Morgan Chase colleagues have worked to create better planning methodologies to deal with big data. Steps are in place to ensure business users have an idea beforehand of the kind of data they want to work with, what business goals they hope to achieve, and what kind of revenue can be expected if the new application is wildly successful.

Big Data

Big data is a term for data sets that are so large or complex that traditional data processing application softwareis inadequate to deal with them. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating and information privacy.

Lately, the term “big data” tends to refer to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. “There is little doubt that the quantities of data now available are indeed large, but that’s not the most relevant characteristic of this new data ecosystem.” Analysis of data sets can find new correlations to “spot business trends, prevent diseases, combat crime and so on.” Scientists, business executives, practitioners of medicine, advertising and governments alike regularly meet difficulties with large data-sets in areas including Internet search, fintech, urban informatics, and business informatics. Scientists encounter limitations in e-Science work, including meteorology, genomics, connectomics, complex physics simulations, biology and environmental research.

Relational database management systems and desktop statistics- and visualization-packages often have difficulty handling big data. The work may require “massively parallel software running on tens, hundreds, or even thousands of servers”. What counts as “big data” varies depending on the capabilities of the users and their tools, and expanding capabilities make big data a moving target. “For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.”

  • The Brain’s On-Ramps
When facing a challenging scientific problem, researchers often turn to supercomputers. These powerful machines crunch large amounts of data and, with the right software, spin out images that make the data easier to understand. Advanced computational methods and technologies, for instance, can provide unprecedented maps of the human brain. In the image at left, the colors represent white matter pathways that allow distant parts of the brain to communicate with each other. For over 30 years, the National Science Foundation has invested in high-performance computing, both pushing the frontiers of advanced computing hardware and software and providing access to supercomputers for researchers across a range of disciplines. Use of NSF-supported research cyberinfrastructure resources is at an all-time high and continues to increase across all science and engineering disciplines.

  • Stellar Turbulence

Simulations help astrophysicists understand and model the turbulent mixing of star gases. This image, created at the Pittsburgh Supercomputing Center (PSC), depicts a 3-D mixing layer between two fluids of different densities in a gravitational field. In this case, a heavy gas is on top of a lighter one. This type of mixing plays an essential role in stellar convection. Understanding mixing dynamics will help researchers with a long-term goal of visualizing the turbulent flows of an entire giant star, one similar to the sun.  PSC is a leading partner in NSF’s eXtreme Science and Engineering Discovery Environment (XSEDE), which provides researchers and educators with the most powerful collection of advanced digital resources in the world.

  • Solar Fury
This glowing maelstrom results from magnetic arcs (orange lines) that shoot hundreds of thousands of kilometers above the sun’s surface. When the electrically charged arcs destabilize, they cause plasma to erupt from the sun’s surface. If one eruption follows another, the second will spew forth faster than the first, catching up and merging with it. Researchers at National Center for Supercomputing Applications’ Advanced Visualization Laboratory are simulating events that trigger solar eruptions to aid prediction of future solar storms. Such advances will improve preparations to mitigate and prevent the storms’ worst effects such as knocking out the electric power grid and disrupting satellite communications.

  • Churning Out a Supernova
This 3-D simulation captures the dynamics that lead to the explosive birth of a supernova. The red colors indicate hot chaotic material and the blue show cold, inert material. Two giant polar lobes form when strongly magnetized material ejected from the star’s center distorts and fails to launch cleanly away. The simulation was created on Stampede, a supercomputer at the Texas Advanced Computing Center.
  • Spinning Hydrogen
On the bridge of the Allosphere, one of the largest immersive scientific instruments in the world, researchers interact with a spinning hydrogen atom. The bridge runs through the center of the spherical display, which includes stereo video projectors covering the entire visual field, immersive audio, and devices to sense, track and engage users. Located at the University of California, Santa Barbara, the Allosphere allows researchers to visualize, explore and evaluate scientific data too small to see and hear. By magnifying the information to the human scale, researchers can better analyze the data to gain new insights into challenging problems.